Sync stops working, no progress made

I have Syncthing 1.4.2 running on eight servers in four pairs syncing a single folder across two data centers, with each server syncing to its peer on the other data center. The folders store files stored by an instance of Minio running on the same servers, and other servers in the data centers write and delete those files through Minio. With some of these server pairs, Syncthing seems to hit a wall during synchronization and simply stop sync altogether. Tracking down the cause has proved extremely elusive, as I cannot find anything in the logs that seems out of the ordinary or particularly consistent.

Here’s what I’ve observed:

  • Scanning seems to finish to completion; the log shows an entry about this after each scan.
  • Syncing seems to progress until it hits a puller error, which can be either “generic error”, “no such file”, or “no connected device has the required version of this file”
  • After the puller error, Syncthing appears to remain stuck in a loop; the log shows progress emitter messages with “nothing new”, connection manager messages show “reconnect loop” every minute. During this time, the watcher seems to still run and log changes, but no data goes across the wire.
  • The thread count goes up from around 65-70 during idle to anywhere between 100 and 800 threads, and then stays there.
  • The folders do not show any status change - could be stuck on “syncing (99%)”, “syncing (100%)”, or “failed items”, it simply doesn’t budge. The last scan timestamp also remains unchanged during this time, and always coincides with the time when the thread count climbs and stays stuck. Download/upload rate also remains at 0 bytes/second except for the occasional ping.
  • Syncthing remains in this state until I pause and unpause the shared folder, at which time it repeats the above scenario. It can remain there for a weekend or more if I choose to let it.

I first started noticing this issue with version 1.3.0, and tried resetting the indices and deltas, which didn’t help. With that version the folders would often show a negative data to sync total whenever sync stopped. Turning off the watcher in either 1.3.0 or 1.4.2 seems to only delay the issue.

Can anyone here help diagnose this issue? I’d like to at least know which debugging facility could provide the smoking gun that will nail the culprit.

Thanks in advance.

To reduce the noise, ideally only keep one of the folders that gets stuck.

Wait for it to get stuck, and once it’s in that state send it a SIGQUIT signal, and capture the stacktraces it dumps to stdout.

Upload them somewhere and post the link here.

Logs from model facility are the interesting ones here I guess, but if it really gets stuck, then I suspect the stacktraces will be more useful.

I captured the following stack traces:

I’ll try to capture log entries from the model facility.

Can you get the stacktraces from stdout rather than service manager?

systemd seems to garble the logs so that you can’t run them through tools that help analyse them.

Nvm, nothing a bit of sed can’t fix.

Do you have the logs to accompany the traces by any chance?

Ideally with some logging facility enabled, so we can see where it gets logged and where the error happens.

Are you able to build your own version to test something out?

I’ve raised a bug here:

This is only hit under very specific conditions.

I suspect you should in general end up with tons of sync conflicts, because it seems both sides are modifying the files concurrently, on both sides, and it looks like they are changing faster than syncthing is able to sync them.

Wow, you’ve whipped up a reply storm here, I didn’t expect that.

I think I could compile my own version; I have Go 1.13.1 installed on my machine, though I don’t know if the project has switched to a new version.

In our use case sync conflicts are not an issue. We have set things up so that for every context (i.e., a subset of files belonging to a particular entity) exactly only one server gets to write the files and the other servers assigned to the context get to read them. This results in writes occurring on both sides simultaneously, but writes for one context only occurs on one side.

If a patch comes out, I’ll build and try it out.

The logs for the respective stack traces:

You can try the build from and see if that fixes it.

I’ve checked out the PR, built it, and deployed it on one of the server pairs. We’ll see how it goes.

There was a build available on the build server having built the branch, but it’s not really relevant anymore I guess.

The build seems to run well so far on the two servers where I installed it. I noticed that one of the servers had a lot of threads running for some unknown reason, so I took a stack trace of it.

I forgot to enable debugging on it, sorry about that.

Screenshot of thread count chart

Thanks for testing. That thread count is a direct effect of the PR and something that’s already under discussion in there (under a slightly different context mostly but still). It’s a ton of that stuff:*folder).doInSync(0xc00dfa0000, 0xc01229e5d0, 0xc0087b7901, 0xc0087b7a70)
        /home/cfernandez/cd-dev/syncthing/lib/model/folder.go:240 +0x10b*folder).Scan(0xc00dfa0000, 0xc0087b7a70, 0x1, 0x1, 0xffffffffffffffff, 0xc00894f930)
        /home/cfernandez/cd-dev/syncthing/lib/model/folder.go:229 +0xa2*folder).ForceRescan(0xc00dfa0000, 0xc00886a620, 0x1a, 0x0, 0x5ea84c52, 0x4cce914736aec73b, 0xc00886a640, 0x2, 0x2, 0x47070b, ...)
        /home/cfernandez/cd-dev/syncthing/lib/model/folder.go:850 +0x16f*model).recheckFile(0xc0001c8900, 0x971f543d33349438, 0x497e357db463d3b3, 0x2f4c7629dd086526, 0xc33c5653cb43318d, 0x1117fc0, 0xc000
209cd0, 0xc01286fb20, 0xb, 0xc008272fc0, ...)
        /home/cfernandez/cd-dev/syncthing/lib/model/model.go:1639 +0x2db
created by*model).Request
        /home/cfernandez/cd-dev/syncthing/lib/model/model.go:1568 +0x1a55

Thanks to all for your help on this. I installed the same build I compiled (the one making the file recheck async) on another pair of servers and it seems to behave properly there as well. I’ll hold off on doing any other changes until at least the next release candidate, though I’m glad that the issue is now addressed.

I’ve updated the original PR with a new implementation that should keep the number of routines lower, and not block on unnecessary things.


This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.