Certain clients unable to connect to eachother

Rewt0r · November 25, 2014, 1:13pm

Our setup:

1 central file server
7 Windows devices (at the moment more to come)

All active clients are able to connect to our central server however some of the windows clients are randomly connecting and randomly not. Logs on all clients at the time this happens throw several:

[xxxxx] 2014/11/25 13:09:01.072040 main.go:1036: DEBUG: dial tcp 31.54.87.xxx:33143: ConnectEx tcp: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

Is there any more debug info we can get than this? I know for a fact that one of the devices isn’t even behind a firewall and is connected directly to the router in the office but my device is unable to connect to it but the central server has a connection to it.

It seems like the connection process works when it wants… Any ideas?

EDIT: I have tried clearing the config on my own device and reconnected it to our central server which has introduced the other clients but it’s still not connecting only to our central server

calmh · November 25, 2014, 1:21pm

In nine cases out of ten that’s a firewall in the way.

(Probably the tenth also, but I can’t prove that.)

Rewt0r · November 25, 2014, 1:25pm

Hi @calmh, There is no firewall in the way as stated in my post. In the office all clients connect directly to the router via wireless.

I’ve just turned on another device in my house and that IS able to connect to a device in that office that my laptop is unable to. (everything was fine on v.10.6)

AudriusButkevicius · November 25, 2014, 1:50pm

Nothing has changed between v0.10.6 and v0.10.7 related to the connections so the issue is at your end. The issue you might be having is that the global discovery wins over local discovery and then UPnP somehow doesn’t work from inside the network. I say this because the IP and the port from the log does not look local, it looks external.

If all of the devices are on the same subnet, I suggest you disable global discovery and see if that helps, or perhaps tweak UPnP settings, as it seems that the UPnP mappings are somehow timing out? Is your lease length larger than the renewal period?

Rewt0r · November 25, 2014, 2:02pm

Hi @AudriusButkevicius,

Let me explain a bit clearer…

My Home Network -> My Laptop My Home Network -> Another Laptop DC Network -> Our Firewall -> Central File Server One Business Building Network (No firewall) -> L’s Laptop

My Laptop can connect to our Central File Server My Laptop can connect to Another Laptop Another Laptop can connect to our Central File Server Another Laptop can connect to My Laptop Another Laptop can connect to L’s Laptop L’s Laptop can connect to our Central File Server L’s Laptop can connect to Another Laptop

As you can see My Laptop cannot connect to L’s laptop and L’s laptop can’t connect to My Laptop yet there is no reason why it shouldn’t

Rewt0r · November 25, 2014, 2:36pm

I’ve just managed to get the log from L’s laptop and as you can see there are lots and lots of the same error message trying to connect to My Laptop.

[XXXXX] 2014/11/25 13:55:14.545432 main.go:802: DEBUG: Renewed UPnP port mapping for external port 33143 on device ‘BT Business Hub 3.0 Media Gateway’ (192.168.1.254). [XXXXX] 2014/11/25 13:57:01.783213 main.go:1022: DEBUG: dial XXXXXXX-XXXXXXX-XXXXXXX-QUUYQ4G-3L4GGD6-XXXXXXX-xxxxxxx-XXXXXXX 81.98.65.XXX:22000 [XXXXX] 2014/11/25 13:57:22.785549 main.go:1036: DEBUG: dial tcp 81.98.65.XXX:22000: ConnectEx tcp: A connection attempt failed because the connected party did not properly respond after a period of time or established connection failed because connected host has failed to respond. [XXXXX] 2014/11/25 13:58:22.786370 main.go:1022: DEBUG: dial XXXXXXX-XXXXXXX-XXXXXXX-QUUYQ4G-3L4GGD6-XXXXXXX-xxxxxxx-XXXXXXX 81.98.65.XXX:22000 [XXXXX] 2014/11/25 13:58:43.791283 main.go:1036: DEBUG: dial tcp 81.98.65.XXX:22000: ConnectEx tcp: A connection attempt failed because the connected party did not properly respond after a period of time or established connection failed because connected host has failed to respond. [XXXXX] 2014/11/25 14:01:45.869149 main.go:1022: DEBUG: dial XXXXXXX-XXXXXXX-XXXXXXX-QUUYQ4G-3L4GGD6-XXXXXXX-xxxxxxx-XXXXXXX 81.98.65.XXX:22000 [XXXXX] 2014/11/25 14:02:06.872646 main.go:1036: DEBUG: dial tcp 81.98.65.XXX:22000: ConnectEx tcp: A connection attempt failed because the connected party did not properly respond after a period of time or established connection failed because connected host has failed to respond. [XXXXX] 2014/11/25 14:03:36.965190 main.go:1022: DEBUG: dial XXXXXXX-XXXXXXX-XXXXXXX-QUUYQ4G-3L4GGD6-XXXXXXX-xxxxxxx-XXXXXXX 81.98.65.XXX:22000 [XXXXX] 2014/11/25 14:03:57.965725 main.go:1036: DEBUG: dial tcp 81.98.65.XXX:22000: ConnectEx tcp: A connection attempt failed because the connected party did not properly respond after a period of time or established connection failed because connected host has failed to respond. [XXXXX] 2014/11/25 14:04:57.974657 main.go:1022: DEBUG: dial XXXXXXX-XXXXXXX-XXXXXXX-QUUYQ4G-3L4GGD6-XXXXXXX-xxxxxxx-XXXXXXX 81.98.65.XXX:22000 [XXXXX] 2014/11/25 14:05:18.979081 main.go:1036: DEBUG: dial tcp 81.98.65.XXX:22000: ConnectEx tcp: A connection attempt failed because the connected party did not properly respond after a period of time or established connection failed because connected host has failed to respond. [XXXXX] 2014/11/25 14:08:04.198283 main.go:1022: DEBUG: dial XXXXXXX-XXXXXXX-XXXXXXX-QUUYQ4G-3L4GGD6-XXXXXXX-xxxxxxx-XXXXXXX 81.98.65.XXX:22000 [XXXXX] 2014/11/25 14:08:25.201402 main.go:1036: DEBUG: dial tcp 81.98.65.XXX:22000: ConnectEx tcp: A connection attempt failed because the connected party did not properly respond after a period of time or established connection failed because connected host has failed to respond.

Rewt0r · November 25, 2014, 3:14pm

One thing I have noticed whilst trying to debug is that Syncthing tries connecting to each device one by one, it’s like it begins trying to connect to device 1, wait for success or failure then moves to the next device. Why doesn’t it just try connecting to them all at once?

AudriusButkevicius · November 25, 2014, 4:11pm

It does that exactly, concurrently, which doesn’t mean parallel. The fact that all of them have the sleep timer of the same length makes it look like it does one after another.

Have you got any limiter settings enabled anywhere? I had similar failures due to crazy low limiter settings.

If not, can you actually telnet from one device to another? I suggest you shut down syncthing after it acquired a UPnP mapping, do a nc -l 22000 and then on the other end do echo test | nc <external ip> <the upnp port> to see if you can actually communicate.

Rewt0r · November 25, 2014, 4:12pm

Right I’ve managed to find another bug whilst investigating this.

I just added 2 new devices in another network, and My Laptop isn’t connecting to one of them. What’s weird is the log shows connecting to New Device #1’s ID but has actually connected to New Device #2 and hence isn’t trying to connect to that id anymore!

It seems as though all of the errors in here are coming from UPnP. I’ve found that restarting Syncthing using the webGUI doesn’t correctly re-map UPnP but if I close the syncthing process and start it up again it works as expected.

EDIT: Also Syncthing is definitely waiting for a success or failure before attempting connection to the next device. It’s proven by the log print out, it shows the debug DIAL then nothing else until a failure/success

EDIT 2: Yes I had tried telnetting to no avail. There’s definitely something going wrong with the UPnP mapping which is what’s causing these issues - even though all UPnP events report NO errors

EDIT 3: I can see Syncthing trying to connect to the non-UPnP IP/Port in the logs several times… It’s like the Discovery server isn’t passing the correct information or is receiving the wrong information to begin with

AudriusButkevicius · November 25, 2014, 4:28pm

Well if your UPnP says Yes I've mapped a port for you, what error do you want us to report? I’ve checked the code, and you are right, it does not dial concurrently, it dials and listens concurrently, but dials each connection one by one.

I think (not sure) that the announce server holds last 1800 seconds worth of announcements and provides all of them to the client. Then the client dials them one by one to see which one works.

Restarting syncthing via web UI has the same effect as restarting syncthing process, it has no difference. The monitor process (which does the restarting) does nothing apart from starting the child process and holds no state.

Rewt0r · November 25, 2014, 4:33pm

See this from the log on MY machine:

[XXXXXXX] 2014/11/25 16:29:47.388358 main.go:1022: DEBUG: dial XXXXXXX-XXXXXXX-SMMRWBB-WCLBC7F-XXXXXXX-XXXXXXX-R5AU6IV-XXXXXXX 217.43.229.XXX:22000

BUT it should actually be connecting to this like our central file server has:

217.43.229.xxx:55361

Would it also be possible to make the connections concurrent as well?

AudriusButkevicius · November 25, 2014, 4:36pm

Well, perhaps as you restarted the server your UPnP device refused to issue a port mapping, which caused syncthing to assume there is no UPnP device and advertise it’s native port.

I think it would be possible to do everything concurrently, and you can make a pull request on Github, but I don’t see how this solves your problem or any problem actually.

Rewt0r · November 25, 2014, 4:42pm

Concurrently is definitely the preferred method so Syncthing doesn’t have to go through 5 failed connections before getting a valid connection and starting to sync.

Why is the discovery service reporting the right external port for the other device on that network which I can connect to? It makes no sense

AudriusButkevicius · November 25, 2014, 4:43pm

I don’t know, run with STTRACE=discovery to see what are the actual lookups returning.

Rewt0r · November 25, 2014, 4:49pm

Already have been doing that, the output doesn’t seem to show any IPs just hex output?

[xxxxx] 2014/11/25 16:45:08.200834 discover.go:434: DEBUG: discover: read external: 00000000 9d 79 bc 39 00 00 00 20 98 e2 9b fe aa 4e 67 3a |.y.9… …Ng:| 00000010 16 2d e9 df 28 52 98 87 36 be 18 c3 f6 54 c0 a5 |.-…(R…6…T…| 00000020 4e e3 b7 d4 b4 93 44 90 00 00 00 01 00 00 00 04 |N…D…| 00000030 51 62 41 c1 00 00 6e 70 00 00 00 00 |QbA…np…| [xxxxx] 2014/11/25 16:45:08.200834 discover.go:286: DEBUG: discover: external lookup check: [81.255.255.193:28272] [xxxxx] 2014/11/25 16:45:12.202996 discover.go:434: DEBUG: discover: read external: 00000000 9d 79 bc 39 00 00 00 20 d1 a5 12 72 59 fd bf 14 |.y.9… …rY…| 00000010 cf 7c 05 c3 a9 e1 58 7b 6e b4 6a 15 2d 0c 96 70 |.|…X{n.j.-…p| 00000020 1e 9e 65 15 64 28 c0 79 00 00 00 01 00 00 00 04 |…e.d(.y…| 00000030 55 0a fa 92 00 00 6e 70 00 00 00 00 |U…np…| [xxxxx] 2014/11/25 16:45:17.466880 discover.go:434: DEBUG: discover: read external: 00000000 9d 79 bc 39 00 00 00 20 43 bb fe d6 f9 b5 a7 de |.y.9… C…| 00000010 36 f2 62 ad a8 ee 44 20 60 04 3a 54 79 e1 1c 0c |6.b…D .:Ty...| 00000020 52 7c c2 42 bd 7c 24 2e 00 00 00 01 00 00 00 04 |R|.B.|$.........| 00000030 51 8a c9 85 00 00 55 f0 00 00 00 00 |Q.....U.....| [xxxxx] 2014/11/25 16:45:17.779066 discover.go:434: DEBUG: discover: read external: 00000000 9d 79 bc 39 00 00 00 20 78 77 e8 38 4c 54 c2 ec |.y.9... xw.8LT..| 00000010 dc 0e 04 0f a0 23 89 71 26 f3 e1 ad 60 d6 03 37 |.....#.q&...…7| 00000020 62 ed 5e 81 7e ca c1 64 00 00 00 01 00 00 00 04 |b.^.~…d…| 00000030 51 8a c9 85 00 00 55 f0 00 00 00 00 |Q…U…| [xxxxx] 2014/11/25 16:45:18.087195 discover.go:434: DEBUG: discover: read external: 00000000 9d 79 bc 39 00 00 00 20 7d d8 ad 87 b6 d7 08 62 |.y.9… }…b| 00000010 96 56 37 e6 c8 d0 65 bf c3 7f 52 6e dd e3 50 2f |.V7…e…Rn…P/| 00000020 a1 db 7e 79 09 db d9 05 00 00 00 01 00 00 00 04 |…~y…| 00000030 d9 2b e5 1f 00 00 c1 8c 00 00 00 00 |.+…| [xxxxx] 2014/11/25 16:45:18.351976 discover.go:434: DEBUG: discover: read external: 00000000 9d 79 bc 39 00 00 00 20 a8 21 b3 19 be 9a da 15 |.y.9… .!..| 00000010 b2 cc d4 77 67 2e 96 0a 79 7a dc 1a 12 28 4d c8 |…wg…yz…(M.| 00000020 b5 ca 90 4d 4c cf fb d7 00 00 00 01 00 00 00 04 |…ML…| 00000030 1f 36 57 bb 00 00 7a af 00 00 00 00 |.6W…z…| [] 2014/11/25 16:45:18.592759 discover.go:434: DEBUG: discover: read external: 00000000 9d 79 bc 39 00 00 00 20 c5 4a 8f bc f1 81 3c 04 |.y.9… .J…<.| 00000010 49 8c 8d 82 1b 09 61 17 fb 83 82 f9 d1 2f c0 13 |I…a…/…| 00000020 31 e8 29 e4 54 0d 4c b2 00 00 00 01 00 00 00 04 |1.).T.L…| 00000030 d9 2b e5 1f 00 00 55 f0 00 00 00 00 |.+…U…| [xxxxx] 2014/11/25 16:45:18.906054 discover.go:434: DEBUG: discover: read external: 00000000 9d 79 bc 39 00 00 00 20 e0 76 32 31 d5 77 60 39 |.y.9… .v21.w`9| 00000010 e3 01 8a 4a ee f0 51 81 27 3f fb 8d 85 b8 e8 71 |…J…Q.’?..q| 00000020 8b 40 53 4d aa d4 c3 ed 00 00 00 01 00 00 00 04 |.@SM…| 00000030 51 8a c9 85 00 00 55 f0 00 00 00 00 |Q…U…|

AudriusButkevicius · November 25, 2014, 6:35pm

TDRJX7V-> 81.255.255.193:28272
2GSRE4S-> 85.255.255.146:28272
IO575VX-> 81.255.255.133:22000
PB36QOC-> 81.255.255.133:22000
PXMK3B5-> 217.255.255.31:49548
VAQ3GGN-> 31.255.255.187:31407
YVFI7PH-> 217.255.255.31:22000
4B3DEMO-> 81.255.255.133:22000

That’s the mappings you get from the announce server. Why are the mappings like this, ask your UPnP device. I guess one of the nodes just fails to get a mapping advertising an incorrrect port.

I can see that all devices in 81.255.255.133 are failing, as well as one device in 217.255.255.31

Rewt0r · November 25, 2014, 7:09pm

81.255.255.133 is failing for internal and external clients (firewall issue I have no access to - waiting for them to allow the ports)

The thing is - our server is getting the correct mapping for YVFI7PH whereas my Laptop is not, that mapping came from my laptop. I’ve enabled discovery logging on our server now in order to get the mapping that receives but this device isn’t currently online so I’m just waiting for it to come back on.

Also how did you convert the IPs into that?

AudriusButkevicius · November 25, 2014, 8:30pm

Last line (always offset 00000030): 51 8a c9 85 is the 4 bytes representing an IP in hex, 51 hex is 81 in dec, etc, etc. 00 00 55 f0 is the 4 byte (16 bit) integer representing the port, 55f0 in hex is 22000 in dec

Basically, if one of your machines made false announcements, you should probably restart it and make sure UPnP works. Then give an hour or two for the announce cache to clear, or reconfigure the IP address of the node from dynamic to some correct address.

Rewt0r · November 25, 2014, 8:43pm

Is there a way to force the announce cache to be cleared?

I have a feeling that there is an issue with BT Hubs UPnP function… Problems seem to be arising from clients connected to either BT Home Hub or BT Business Hub and thanks for explaining which parts of the debug information I need to use.

AudriusButkevicius · November 25, 2014, 9:43pm

So the cache entries are only valid for an hour, and given they are not re-advertised, they expire. But it seems that entries for the same IP with a new port should be updated instantly, given the announcement arrives: https://github.com/syncthing/discosrv/blob/master/main.go#L345