After upgrading to v0.11.22 nodes no longer sync

exhuma · August 27, 2015, 12:41pm

As the title states, I recently upgraded my nodes. This was due to a DNS lookup problem which has been solved by compiling against go1.5.

Unfortunately, now my nodes no longer sync. The web-interface shows them disconnected on both sides. And I don’t know how to debug this. At least the “Global Discovery” says “OK”.

I’ve tries using STTRACE=beacon,discover as I did with the DNS issue. However, I can’t derive anything useful from that output. Also, looking at the other possible trace flags, I don’t see anything that could be used to debug the connection between nodes.

How do I do that? I’m curious as to why they don’t connect. Don’t they see each other? Is there something wrong during the connection handshake? I can’t tell

calmh · August 27, 2015, 1:26pm

v0.11.22 is built with Go 1.4 again, as Go 1.5 has a pretty bad bag that affects all discovery. So the DNS issue you suffered from earlier is back again I’m afraid.

Local and global discovery should work though.

exhuma · August 27, 2015, 1:34pm

Hmmm… just realised that I still have the entries in my /etc/hosts file. I forgot to remove them after the upgrade. This is also why I did not see anything suspicious in the log after upgrading. As you say, the DNS issue should be back and I should have seen something in the logs. But as I still had the entries in the hosts file, it kept on working

I also double-checked the IP and it is unchanged.

However, the nodes still show up as “disconnected” on both ends.

calmh · August 27, 2015, 1:38pm

STTRACE=net will show you all connection activity. It may have something useful.

exhuma · August 27, 2015, 1:51pm

I have no idea why, but it just ended up connecting. I only restarted the service to enable other trace facilities. Maybe it did not restart properly beforehand (when I made the upgrade)? I used the restart from the web-interface. Although, the web interface did show the proper version number…

This is strange.

The wall of text which follows is all wild guesswork…

A colleague just pointed something out. First, you need to know that I have 3 machines in the network (home server, home desktop and mobile laptop). Connected in a full-mesh.

All of them are behind a NATed connection. Two in my home net, one behind another NAT (the office). I have access to port forward configuration in my home NAT. I cannot configure forwards on the one from the office as my colleague also uses ST. Unless I would chose another port, but I guess that would open up other points of failure… maybe I’ll do that in the future

So essentially, the machine from the office can connect to the machines in the home network but the reverse is not possible.

Wildly guessing is that the home server tried to connect to the laptop in the office but failed. Now, after restarting, the office laptop came in first and managed to connect to the home server.

I’m not sure if there’s some kind of race condition at work or if it’s just sheer bad luck…

It’s unfortunate that I don’t have any logs from the rest of the day, so I’m afraid it will remain guess-work

AudriusButkevicius · August 27, 2015, 5:03pm

If disconnected both sides always attempt to connect, every 60 seconds by default.