All active clients are able to connect to our central server however some of the windows clients are randomly connecting and randomly not. Logs on all clients at the time this happens throw several:
[xxxxx] 2014/11/25 13:09:01.072040 main.go:1036: DEBUG: dial tcp 31.54.87.xxx:33143: ConnectEx tcp: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
Is there any more debug info we can get than this? I know for a fact that one of the devices isn’t even behind a firewall and is connected directly to the router in the office but my device is unable to connect to it but the central server has a connection to it.
It seems like the connection process works when it wants… Any ideas?
EDIT: I have tried clearing the config on my own device and reconnected it to our central server which has introduced the other clients but it’s still not connecting only to our central server
Hi @calmh, There is no firewall in the way as stated in my post. In the office all clients connect directly to the router via wireless.
I’ve just turned on another device in my house and that IS able to connect to a device in that office that my laptop is unable to. (everything was fine on v.10.6)
Nothing has changed between v0.10.6 and v0.10.7 related to the connections so the issue is at your end.
The issue you might be having is that the global discovery wins over local discovery and then UPnP somehow doesn’t work from inside the network. I say this because the IP and the port from the log does not look local, it looks external.
If all of the devices are on the same subnet, I suggest you disable global discovery and see if that helps, or perhaps tweak UPnP settings, as it seems that the UPnP mappings are somehow timing out? Is your lease length larger than the renewal period?
My Home Network -> My Laptop
My Home Network -> Another Laptop
DC Network -> Our Firewall -> Central File Server
One Business Building Network (No firewall) -> L’s Laptop
My Laptop can connect to our Central File Server
My Laptop can connect to Another Laptop
Another Laptop can connect to our Central File Server
Another Laptop can connect to My Laptop
Another Laptop can connect to L’s Laptop
L’s Laptop can connect to our Central File Server
L’s Laptop can connect to Another Laptop
As you can see My Laptop cannot connect to L’s laptop and L’s laptop can’t connect to My Laptop yet there is no reason why it shouldn’t
I’ve just managed to get the log from L’s laptop and as you can see there are lots and lots of the same error message trying to connect to My Laptop.
[XXXXX] 2014/11/25 13:55:14.545432 main.go:802: DEBUG: Renewed UPnP port mapping for external port 33143 on device ‘BT Business Hub 3.0 Media Gateway’ (192.168.1.254).
[XXXXX] 2014/11/25 13:57:01.783213 main.go:1022: DEBUG: dial XXXXXXX-XXXXXXX-XXXXXXX-QUUYQ4G-3L4GGD6-XXXXXXX-xxxxxxx-XXXXXXX 81.98.65.XXX:22000
[XXXXX] 2014/11/25 13:57:22.785549 main.go:1036: DEBUG: dial tcp 81.98.65.XXX:22000: ConnectEx tcp: A connection attempt failed because the connected party did not properly respond after a period of time or established connection failed because connected host has failed to respond.
[XXXXX] 2014/11/25 13:58:22.786370 main.go:1022: DEBUG: dial XXXXXXX-XXXXXXX-XXXXXXX-QUUYQ4G-3L4GGD6-XXXXXXX-xxxxxxx-XXXXXXX 81.98.65.XXX:22000
[XXXXX] 2014/11/25 13:58:43.791283 main.go:1036: DEBUG: dial tcp 81.98.65.XXX:22000: ConnectEx tcp: A connection attempt failed because the connected party did not properly respond after a period of time or established connection failed because connected host has failed to respond.
[XXXXX] 2014/11/25 14:01:45.869149 main.go:1022: DEBUG: dial XXXXXXX-XXXXXXX-XXXXXXX-QUUYQ4G-3L4GGD6-XXXXXXX-xxxxxxx-XXXXXXX 81.98.65.XXX:22000
[XXXXX] 2014/11/25 14:02:06.872646 main.go:1036: DEBUG: dial tcp 81.98.65.XXX:22000: ConnectEx tcp: A connection attempt failed because the connected party did not properly respond after a period of time or established connection failed because connected host has failed to respond.
[XXXXX] 2014/11/25 14:03:36.965190 main.go:1022: DEBUG: dial XXXXXXX-XXXXXXX-XXXXXXX-QUUYQ4G-3L4GGD6-XXXXXXX-xxxxxxx-XXXXXXX 81.98.65.XXX:22000
[XXXXX] 2014/11/25 14:03:57.965725 main.go:1036: DEBUG: dial tcp 81.98.65.XXX:22000: ConnectEx tcp: A connection attempt failed because the connected party did not properly respond after a period of time or established connection failed because connected host has failed to respond.
[XXXXX] 2014/11/25 14:04:57.974657 main.go:1022: DEBUG: dial XXXXXXX-XXXXXXX-XXXXXXX-QUUYQ4G-3L4GGD6-XXXXXXX-xxxxxxx-XXXXXXX 81.98.65.XXX:22000
[XXXXX] 2014/11/25 14:05:18.979081 main.go:1036: DEBUG: dial tcp 81.98.65.XXX:22000: ConnectEx tcp: A connection attempt failed because the connected party did not properly respond after a period of time or established connection failed because connected host has failed to respond.
[XXXXX] 2014/11/25 14:08:04.198283 main.go:1022: DEBUG: dial XXXXXXX-XXXXXXX-XXXXXXX-QUUYQ4G-3L4GGD6-XXXXXXX-xxxxxxx-XXXXXXX 81.98.65.XXX:22000
[XXXXX] 2014/11/25 14:08:25.201402 main.go:1036: DEBUG: dial tcp 81.98.65.XXX:22000: ConnectEx tcp: A connection attempt failed because the connected party did not properly respond after a period of time or established connection failed because connected host has failed to respond.
One thing I have noticed whilst trying to debug is that Syncthing tries connecting to each device one by one, it’s like it begins trying to connect to device 1, wait for success or failure then moves to the next device. Why doesn’t it just try connecting to them all at once?
It does that exactly, concurrently, which doesn’t mean parallel.
The fact that all of them have the sleep timer of the same length makes it look like it does one after another.
Have you got any limiter settings enabled anywhere? I had similar failures due to crazy low limiter settings.
If not, can you actually telnet from one device to another?
I suggest you shut down syncthing after it acquired a UPnP mapping, do a nc -l 22000 and then on the other end do echo test | nc <external ip> <the upnp port> to see if you can actually communicate.
Right I’ve managed to find another bug whilst investigating this.
I just added 2 new devices in another network, and My Laptop isn’t connecting to one of them. What’s weird is the log shows connecting to New Device #1’s ID but has actually connected to New Device #2 and hence isn’t trying to connect to that id anymore!
It seems as though all of the errors in here are coming from UPnP. I’ve found that restarting Syncthing using the webGUI doesn’t correctly re-map UPnP but if I close the syncthing process and start it up again it works as expected.
EDIT: Also Syncthing is definitely waiting for a success or failure before attempting connection to the next device. It’s proven by the log print out, it shows the debug DIAL then nothing else until a failure/success
EDIT 2: Yes I had tried telnetting to no avail. There’s definitely something going wrong with the UPnP mapping which is what’s causing these issues - even though all UPnP events report NO errors
EDIT 3: I can see Syncthing trying to connect to the non-UPnP IP/Port in the logs several times… It’s like the Discovery server isn’t passing the correct information or is receiving the wrong information to begin with
Well if your UPnP says Yes I've mapped a port for you, what error do you want us to report?
I’ve checked the code, and you are right, it does not dial concurrently, it dials and listens concurrently, but dials each connection one by one.
I think (not sure) that the announce server holds last 1800 seconds worth of announcements and provides all of them to the client.
Then the client dials them one by one to see which one works.
Restarting syncthing via web UI has the same effect as restarting syncthing process, it has no difference.
The monitor process (which does the restarting) does nothing apart from starting the child process and holds no state.
Well, perhaps as you restarted the server your UPnP device refused to issue a port mapping, which caused syncthing to assume there is no UPnP device and advertise it’s native port.
I think it would be possible to do everything concurrently, and you can make a pull request on Github, but I don’t see how this solves your problem or any problem actually.
Concurrently is definitely the preferred method so Syncthing doesn’t have to go through 5 failed connections before getting a valid connection and starting to sync.
Why is the discovery service reporting the right external port for the other device on that network which I can connect to? It makes no sense
That’s the mappings you get from the announce server.
Why are the mappings like this, ask your UPnP device.
I guess one of the nodes just fails to get a mapping advertising an incorrrect port.
I can see that all devices in 81.255.255.133 are failing, as well as one device in 217.255.255.31
81.255.255.133 is failing for internal and external clients (firewall issue I have no access to - waiting for them to allow the ports)
The thing is - our server is getting the correct mapping for YVFI7PH whereas my Laptop is not, that mapping came from my laptop. I’ve enabled discovery logging on our server now in order to get the mapping that receives but this device isn’t currently online so I’m just waiting for it to come back on.
Last line (always offset 00000030):
51 8a c9 85 is the 4 bytes representing an IP in hex, 51 hex is 81 in dec, etc, etc.
00 00 55 f0 is the 4 byte (16 bit) integer representing the port, 55f0 in hex is 22000 in dec
Basically, if one of your machines made false announcements, you should probably restart it and make sure UPnP works.
Then give an hour or two for the announce cache to clear, or reconfigure the IP address of the node from dynamic to some correct address.
Is there a way to force the announce cache to be cleared?
I have a feeling that there is an issue with BT Hubs UPnP function… Problems seem to be arising from clients connected to either BT Home Hub or BT Business Hub and thanks for explaining which parts of the debug information I need to use.
So the cache entries are only valid for an hour, and given they are not re-advertised, they expire.
But it seems that entries for the same IP with a new port should be updated instantly, given the announcement arrives:
https://github.com/syncthing/discosrv/blob/master/main.go#L345