Basically, the problem is that the two devices appear connected, but syncing is stalled, and this state seems to go for hours. Moreover, the first device shows the other as “connected” even after shutting Syncthing down on the remote side.
The only workaround that seems to work is to restart Syncthing on the first device. This will reset the connection, and only then the remote device can actually connect for real and continue the synchronisation.
Do you have any idea what the culprit may be here? Both devices run the same version of Syncthing, and they are located on the same local network.
I have no idea, but full logs and a goroutine trace (kill -QUIT or using the online profiler as described … previously … somewhere) would be useful in narrowing it down, assuming something has deadlocked somehow.
Just to confirm, as the problem has currently got fixed by restarting Syncthing on the first device, so I will likely need to wait for it to reoccur to get any more information.
Will the normal logs be of any use, or do I need to enable debug logging (e.g. connections)?
Will STPROFILER (i.e. http://localhost:9090/debug/pprof) be enough for this purpose, or do I need to use something else? I can use kill -QUIT on the second device, but not on the first one (due to it being Windows).
I have the standard logfiles from both devices. I have filtered all lines that mention the other device in each of them.
D4DZUGP:
[D4DZU] 2021/07/26 06:04:26 INFO: Established secure connection to 6I7GNCY at 10.0.0.2:22000-10.0.0.5:53529/tcp-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256
[D4DZU] 2021/07/26 06:04:26 INFO: Device 6I7GNCY client is "syncthing v1.18.1-dev.5.g1e19ed20-tomasz86-v1.18.0" named "6I7GNCY" at 10.0.0.2:22000-10.0.0.5:53529/tcp-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256
[D4DZU] 2021/07/26 06:13:23 INFO: Connection to 6I7GNCY at 10.0.0.2:22000-10.0.0.5:53529/tcp-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256 closed: reading length: read tcp 10.0.0.2:22000->10.0.0.5:53529: wsarecv: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
[D4DZU] 2021/07/26 06:31:51 INFO: Established secure connection to 6I7GNCY at 10.0.0.2:22000-10.0.0.5:53891/tcp-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256
[D4DZU] 2021/07/26 06:31:51 INFO: Device 6I7GNCY client is "syncthing v1.18.1-dev.5.g1e19ed20-tomasz86-v1.18.0" named "6I7GNCY" at 10.0.0.2:22000-10.0.0.5:53891/tcp-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256
[D4DZU] 2021/07/26 22:26:15 INFO: Pausing 6I7GNCY
[D4DZU] 2021/07/26 22:26:34 INFO: Resuming 6I7GNCY
[D4DZU] 2021/07/26 22:28:46 INFO: Established secure connection to 6I7GNCY at 10.0.0.2:22000-10.0.0.5:54416/tcp-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256
[D4DZU] 2021/07/26 22:28:46 INFO: Replacing old connection 10.0.0.2:22000-10.0.0.5:53891/tcp-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256 with 10.0.0.2:22000-10.0.0.5:54416/tcp-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256 for 6I7GNCY
[D4DZU] 2021/07/26 22:32:56 INFO: Pausing 6I7GNCY
[D4DZU] 2021/07/26 22:32:58 INFO: Resuming 6I7GNCY
[D4DZU] 2021/07/26 22:33:35 INFO: Pausing 6I7GNCY
[D4DZU] 2021/07/26 22:33:47 INFO: Resuming 6I7GNCY
[D4DZU] 2021/07/26 22:34:16 INFO: Pausing 6I7GNCY
[D4DZU] 2021/07/26 22:34:21 INFO: Resuming 6I7GNCY
[D4DZU] 2021/07/26 22:35:17 INFO: Pausing 6I7GNCY
[D4DZU] 2021/07/26 22:35:25 INFO: Resuming 6I7GNCY
[D4DZU] 2021/07/26 22:35:33 INFO: Pausing 6I7GNCY
[D4DZU] 2021/07/26 22:35:35 INFO: Resuming 6I7GNCY
[D4DZU] 2021/07/26 22:36:11 INFO: Pausing 6I7GNCY
[D4DZU] 2021/07/26 22:36:13 INFO: Resuming 6I7GNCY
[D4DZU] 2021/07/26 22:37:11 INFO: Pausing 6I7GNCY
[D4DZU] 2021/07/26 22:37:14 INFO: Resuming 6I7GNCY
[D4DZU] 2021/07/26 22:37:53 INFO: Pausing 6I7GNCY
[D4DZU] 2021/07/26 22:38:11 INFO: Resuming 6I7GNCY
[D4DZU] 2021/07/26 22:47:12.093773 model.go:2845: INFO: Pausing 6I7GNCY
[D4DZU] 2021/07/26 22:47:15.406051 model.go:2854: INFO: Resuming 6I7GNCY
[D4DZU] 22:49:21 INFO: Connection to 6I7GNCY at 10.0.0.2:22000-10.0.0.5:53891/tcp-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256 closed: device is paused
[D4DZU] 22:49:21 INFO: Device 6I7GNCY client is "syncthing v1.18.1-dev.5.g1e19ed20-tomasz86-v1.18.0" named "6I7GNCY" at 10.0.0.2:22000-10.0.0.5:54416/tcp-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256
[D4DZU] 22:49:21 INFO: Connection to 6I7GNCY at 10.0.0.2:22000-10.0.0.5:54416/tcp-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256 closed: reading length: read tcp 10.0.0.2:22000->10.0.0.5:54416: wsarecv: An existing connection was forcibly closed by the remote host.
[D4DZU] 2021/07/26 22:49:34 INFO: Device 6I7GNCY is "6I7GNCY" at [dynamic]
[D4DZU] 2021/07/26 22:49:36 INFO: Established secure connection to 6I7GNCY at [::]:22000-10.0.0.5:22000/quic-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256
[D4DZU] 2021/07/26 22:49:36 INFO: Device 6I7GNCY client is "syncthing v1.18.1-dev.5.g1e19ed20-tomasz86-v1.18.0" named "6I7GNCY" at [::]:22000-10.0.0.5:22000/quic-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256
[D4DZU] 2021/07/26 22:49:38 INFO: Established secure connection to 6I7GNCY at 10.0.0.2:22000-10.0.0.5:22000/tcp-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256
[D4DZU] 2021/07/26 22:49:38 INFO: Replacing old connection [::]:22000-10.0.0.5:22000/quic-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256 with 10.0.0.2:22000-10.0.0.5:22000/tcp-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256 for 6I7GNCY
[D4DZU] 2021/07/26 22:49:38 INFO: Connection to 6I7GNCY at [::]:22000-10.0.0.5:22000/quic-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256 closed: replacing connection
[D4DZU] 2021/07/26 22:49:38 INFO: Device 6I7GNCY client is "syncthing v1.18.1-dev.5.g1e19ed20-tomasz86-v1.18.0" named "6I7GNCY" at 10.0.0.2:22000-10.0.0.5:22000/tcp-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256
[D4DZU] 2021/07/26 22:49:38 INFO: Connected to already connected device 6I7GNCY (existing: 10.0.0.2:22000-10.0.0.5:22000/tcp-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256 new: 10.0.0.2:22000-10.0.0.5:54753/tcp-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256)
[D4DZU] 2021/07/26 23:14:05 INFO: Connection to 6I7GNCY at 10.0.0.2:22000-10.0.0.5:22000/tcp-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256 closed: closed by remote: Syncthing is being stopped
6I7GNCY
[6I7GN] 2021/07/26 06:04:31 INFO: Device D4DZUGP is "D4DZUGP" at [dynamic]
[6I7GN] 2021/07/26 06:04:33 INFO: Established secure connection to D4DZUGP at 10.0.0.5:53529-10.0.0.2:22000/tcp-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256
[6I7GN] 2021/07/26 06:04:33 INFO: Device D4DZUGP client is "syncthing v1.18.1-dev.5.g1e19ed20-tomasz86-v1.18.0" named "D4DZUGP" at 10.0.0.5:53529-10.0.0.2:22000/tcp-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256
[6I7GN] 2021/07/26 06:11:44 INFO: Connection to D4DZUGP at 10.0.0.5:53529-10.0.0.2:22000/tcp-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256 closed: Syncthing is being stopped
[6I7GN] 2021/07/26 06:31:56 INFO: Device D4DZUGP is "D4DZUGP" at [dynamic]
[6I7GN] 2021/07/26 06:31:58 INFO: Established secure connection to D4DZUGP at 10.0.0.5:53891-10.0.0.2:22000/tcp-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256
[6I7GN] 2021/07/26 06:31:58 INFO: Device D4DZUGP client is "syncthing v1.18.1-dev.5.g1e19ed20-tomasz86-v1.18.0" named "D4DZUGP" at 10.0.0.5:53891-10.0.0.2:22000/tcp-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256
[6I7GN] 2021/07/26 22:26:24 INFO: Connection to D4DZUGP at 10.0.0.5:53891-10.0.0.2:22000/tcp-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256 closed: reading length: read tcp 10.0.0.5:53891->10.0.0.2:22000: read: connection reset by peer
[6I7GN] 2021/07/26 22:28:55 INFO: Established secure connection to D4DZUGP at 10.0.0.5:54416-10.0.0.2:22000/tcp-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256
[6I7GN] 2021/07/26 22:28:55 INFO: Device D4DZUGP client is "syncthing v1.18.1-dev.5.g1e19ed20-tomasz86-v1.18.0" named "D4DZUGP" at 10.0.0.5:54416-10.0.0.2:22000/tcp-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256
[6I7GN] 2021/07/26 22:32:46 INFO: Pausing D4DZUGP
[6I7GN] 2021/07/26 22:32:46 INFO: Connection to D4DZUGP at 10.0.0.5:54416-10.0.0.2:22000/tcp-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256 closed: device is paused
[6I7GN] 2021/07/26 22:32:48 INFO: Resuming D4DZUGP
[6I7GN] 2021/07/26 22:36:24 INFO: Failed to exchange Hello messages with D4DZUGP at [::]:22000-10.0.0.2:22000/quic-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256: deadline exceeded
[6I7GN] 2021/07/26 22:37:00 INFO: Failed to exchange Hello messages with D4DZUGP at [::]:22000-10.0.0.2:22000/quic-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256: deadline exceeded
[6I7GN] 2021/07/26 22:38:50 INFO: Failed to exchange Hello messages with D4DZUGP at [::]:22000-10.0.0.2:22000/quic-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256: deadline exceeded
[6I7GN] 2021/07/26 22:39:30 INFO: Failed to exchange Hello messages with D4DZUGP at [::]:22000-10.0.0.2:22000/quic-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256: deadline exceeded
[6I7GN] 2021/07/26 22:41:13 INFO: Failed to exchange Hello messages with D4DZUGP at [::]:22000-10.0.0.2:22000/quic-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256: deadline exceeded
[6I7GN] 2021/07/26 22:41:55 INFO: Failed to exchange Hello messages with D4DZUGP at [::]:22000-10.0.0.2:22000/quic-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256: deadline exceeded
[6I7GN] 2021/07/26 22:43:28 INFO: Failed to exchange Hello messages with D4DZUGP at [::]:22000-10.0.0.2:22000/quic-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256: deadline exceeded
[6I7GN] 2021/07/26 22:44:12 INFO: Failed to exchange Hello messages with D4DZUGP at [::]:22000-10.0.0.2:22000/quic-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256: deadline exceeded
[6I7GN] 2021/07/26 22:45:35 INFO: Failed to exchange Hello messages with D4DZUGP at [::]:22000-10.0.0.2:22000/quic-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256: deadline exceeded
[6I7GN] 2021/07/26 22:46:08 INFO: Failed to exchange Hello messages with D4DZUGP at [::]:22000-10.0.0.2:22000/quic-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256: deadline exceeded
[6I7GN] 2021/07/26 22:47:32 INFO: Failed to exchange Hello messages with D4DZUGP at [::]:22000-10.0.0.2:22000/quic-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256: deadline exceeded
[6I7GN] 2021/07/26 22:48:20 INFO: Failed to exchange Hello messages with D4DZUGP at [::]:22000-10.0.0.2:22000/quic-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256: deadline exceeded
[6I7GN] 2021/07/26 22:48:32 INFO: Device D4DZUGP is "D4DZUGP" at [dynamic]
[6I7GN] 2021/07/26 22:49:14 INFO: Failed to exchange Hello messages with D4DZUGP at [::]:22000-10.0.0.2:22000/quic-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256: deadline exceeded
[6I7GN] 2021/07/26 22:49:39 INFO: Failed to exchange Hello messages with D4DZUGP at [::]:22000-10.0.0.2:22000/quic-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256: deadline exceeded
[6I7GN] 2021/07/26 22:49:45 INFO: Established secure connection to D4DZUGP at [::]:22000-10.0.0.2:22000/quic-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256
[6I7GN] 2021/07/26 22:49:45 INFO: Device D4DZUGP client is "syncthing v1.18.1-dev.5.g1e19ed20-tomasz86-v1.18.0" named "D4DZUGP" at [::]:22000-10.0.0.2:22000/quic-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256
[6I7GN] 2021/07/26 22:49:46 INFO: Established secure connection to D4DZUGP at 10.0.0.5:22000-10.0.0.2:22000/tcp-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256
[6I7GN] 2021/07/26 22:49:46 INFO: Replacing old connection [::]:22000-10.0.0.2:22000/quic-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256 with 10.0.0.5:22000-10.0.0.2:22000/tcp-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256 for D4DZUGP
[6I7GN] 2021/07/26 22:49:46 INFO: Connection to D4DZUGP at [::]:22000-10.0.0.2:22000/quic-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256 closed: replacing connection
[6I7GN] 2021/07/26 22:49:46 INFO: Device D4DZUGP client is "syncthing v1.18.1-dev.5.g1e19ed20-tomasz86-v1.18.0" named "D4DZUGP" at 10.0.0.5:22000-10.0.0.2:22000/tcp-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256
[6I7GN] 2021/07/26 22:49:47 INFO: Connected to already connected device D4DZUGP (existing: 10.0.0.5:22000-10.0.0.2:22000/tcp-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256 new: 10.0.0.5:54753-10.0.0.2:22000/tcp-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256)
[6I7GN] 2021/07/26 22:49:47 INFO: Failed to exchange Hello messages with D4DZUGP at [fe80::5e3c:27ff:fe37:988d%wlan0]:22000-[fe80::9868:3753:49ff:a06%wlan0]:22000/tcp-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256: write tcp [fe80::5e3c:27ff:fe37:988d%wlan0]:22000->[fe80::9868:3753:49ff:a06%wlan0]:22000: write: connection reset by peer
[6I7GN] 2021/07/26 23:14:13 INFO: Connection to D4DZUGP at 10.0.0.5:22000-10.0.0.2:22000/tcp-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256 closed: Syncthing is being stopped
However, they show nothing going on between 2021/07/26 06:31 (when I believe 6I7GNCY was turned on) and 2021/07/26 22:26 (when I noticed the issue and started trying to fix it). Files were stuck at least since 2021/07/26 10:37, when I deleted a file on D4DZUGP, but the deletion was not synced to 6I7GNCY. It was listed in the Out of Sync Items on D4DZUGP, but 6I7GNCY kept showing everything Up to Date. However, the folder state didn’t match, which made me realise that synchronisation was indeed stuck. Then at 2021/07/26 22:26 I started trying pausing and resuming devices to get them to work again, which didn’t help, so in the end I had to restart Syncthing on D4DZUGP to force it to finally reconnect for real and synchronise the files.
No, D4DZUGP is my fastest desktop PC, and 6I7GNCY is a 4-core midrange Android tablet. It usually sits connected to a power source with no activity during the day, except for Syncthing running in the background. It also syncs just a few folders, most of which rarely change.
For the record, the Web GUI was available all the time during the stall, and it was very responsive (on both devices). Also, restarting Syncthing on 6I7GNCY did nothing, and it was only after D4DZUGP restarted that the two devices resumed synchronisation, so at this moment I suspect D4DZUGP rather than 6I7GNCY to be the culprit here (and D4DZUGP definitely lacks no CPU, RAM or I/O).
There’s lots of low-level errors about connetions timing out, so there seems to be some networking issues. That there is no no logs about closed connections when pausing is unexpected: It looks like it stalls on closing the connection. The Replacing old... messages is followed by waiting for the connection to close and there’s not more logging after that. A goroutine dump would help (which can be obtained through like you said in 2.).
The issue has just happened again. This time with a different device on the other side, but the main culprit seems to be D4DZUG again. HTNCZ5 is a 4-core Intel Atom netbook running Windows, with the database located on a fast internal SSD, so there should be no I/O slowdowns here either.
The rest of the symptoms are exactly the same as described previously. I have taken goroutine dumps from both devices. Please have a look at them when you have some free time .
Not sure if related, but I have just had Syncthing on D4DZUG completely stuck after removing a folder, and then trying to restart using the GUI. The GUI did show a message about Syncthing getting shut down, but the process was hanging in the background. I had to kill it, and then restart manually to bring the program back to life.
Then, once it did restart, the previously “removed” folder has come back, and I had to remove it again, this time successfully. I think that the issue may be related, because when I was removing the folder for the first time, D4DZUG appeared to be stuck in the same state as before again (i.e. files stuck in transfer, etc.).
Is this true for both cases? I was indeed trying to delete a folder in the latter (although I don’t remember if it was paused or not), but in the former there were no folder deletions involved whatsoever.