I have Syncthing 1.4.2 running on eight servers in four pairs syncing a single folder across two data centers, with each server syncing to its peer on the other data center. The folders store files stored by an instance of Minio running on the same servers, and other servers in the data centers write and delete those files through Minio. With some of these server pairs, Syncthing seems to hit a wall during synchronization and simply stop sync altogether. Tracking down the cause has proved extremely elusive, as I cannot find anything in the logs that seems out of the ordinary or particularly consistent.
Here’s what I’ve observed:
Scanning seems to finish to completion; the log shows an entry about this after each scan.
Syncing seems to progress until it hits a puller error, which can be either “generic error”, “no such file”, or “no connected device has the required version of this file”
After the puller error, Syncthing appears to remain stuck in a loop; the log shows progress emitter messages with “nothing new”, connection manager messages show “reconnect loop” every minute. During this time, the watcher seems to still run and log changes, but no data goes across the wire.
The thread count goes up from around 65-70 during idle to anywhere between 100 and 800 threads, and then stays there.
The folders do not show any status change - could be stuck on “syncing (99%)”, “syncing (100%)”, or “failed items”, it simply doesn’t budge. The last scan timestamp also remains unchanged during this time, and always coincides with the time when the thread count climbs and stays stuck. Download/upload rate also remains at 0 bytes/second except for the occasional ping.
Syncthing remains in this state until I pause and unpause the shared folder, at which time it repeats the above scenario. It can remain there for a weekend or more if I choose to let it.
I first started noticing this issue with version 1.3.0, and tried resetting the indices and deltas, which didn’t help. With that version the folders would often show a negative data to sync total whenever sync stopped. Turning off the watcher in either 1.3.0 or 1.4.2 seems to only delay the issue.
Can anyone here help diagnose this issue? I’d like to at least know which debugging facility could provide the smoking gun that will nail the culprit.
I suspect you should in general end up with tons of sync conflicts, because it seems both sides are modifying the files concurrently, on both sides, and it looks like they are changing faster than syncthing is able to sync them.
Wow, you’ve whipped up a reply storm here, I didn’t expect that.
I think I could compile my own version; I have Go 1.13.1 installed on my machine, though I don’t know if the project has switched to a new version.
In our use case sync conflicts are not an issue. We have set things up so that for every context (i.e., a subset of files belonging to a particular entity) exactly only one server gets to write the files and the other servers assigned to the context get to read them. This results in writes occurring on both sides simultaneously, but writes for one context only occurs on one side.
If a patch comes out, I’ll build and try it out.
The build seems to run well so far on the two servers where I installed it. I noticed that one of the servers had a lot of threads running for some unknown reason, so I took a stack trace of it.
Thanks for testing. That thread count is a direct effect of the PR and something that’s already under discussion in there (under a slightly different context mostly but still). It’s a ton of that stuff:
Thanks to all for your help on this. I installed the same build I compiled (the one making the file recheck async) on another pair of servers and it seems to behave properly there as well. I’ll hold off on doing any other changes until at least the next release candidate, though I’m glad that the issue is now addressed.