Folders stuck trying to sync existing files

Firstly, just to give more context, this is the affected Device A.

This is the other side, i.e. Device B.

Only 1 folder seemed to be affected by the issue. The folder itself is shared between multiple devices, but only the share between these two devices had the problem.

What is interesting is that the folder state was exactly the same on all devices.

However, the number of the stuck files was larger than the actual folder state.

This is interesting. The steps 1-5 did not help. However, after restarting A with -reset-deltas, the issue seemed to have resolved itself for a moment, but then this came up.

New changes to the folder got stuck again, but this time it was not only A, but also other devices that had these files stuck trying to push them to B. Device B still marked everything as “Up to Date”.

Now, I am not a fan of resetting deltas/database on B, as the hardware is slow, and there are tons of folders, but there was no choice, so I did it, but then everything got quite messy. While the problematic folder shared between B and A seemed to get fixed, B itself got stuck trying to send stuff to other devices with no progress. I am not sure what the problem was exactly about, but I restarted Syncthing on B once again, and now it seems to be pushing indexes to the rest of the cluster slowly. It may take a few hours until everything stabilises.

I will report back again later once I can say for sure what the situation actually looks like.

Edit: The situation seems to have normalised, at least for now. I will write back if the problem re-occurs.

2 Likes

The problem has manifested itself yet again after “upgrading” Syncthing on 1 device. Not really upgrading, as I just switched from x86-32 to x86-64, but the binary has been replaced nevertheless.

It seems that after doing the upgrade, at least one of the other connected devices gets stuck in this state. The files themselves are old and have not changed at all. It is only the state that is broken, as the folders are in fact 100% in sync.

I really need to figure out a way to reproduce this in a clean environment…

1 Like

As you have sendFullIndexOnUpgrade enabled, there’s some chance that you are affected by something fixed in lib/db: Fix and improve removing entries from global (ref #6501) by imsodin · Pull Request #7336 · syncthing/syncthing · GitHub. That’s not certain at all though - no promises but a small sliver of hope :slight_smile:

That would be awesome indeed. I have suspect that there’s races involved, which makes it very hard to reproduce.

Do you mean that using this option could cause the issue? I am asking because I did not have it set before when the problem appeared for the first time. I have actually enabled it thinking that it may prevent this specific behaviour, but obviously it is not really working, so there must be more to it.

I am not saying that it does. It’s really important to stress that I still don’t know what causes these issues. I am not uncertain about a or several possible causes, I don’t know any causes. I just found a bug, that is related to resetting indexes, which happens on upgrade with sendFullIndexOnUpgrade. That makes it possible, that it causes the issue you see, but I don’t know e.g. a sequence of events that would trigger that bug and result in the issue you see. Without a reproducer all I can say is: Lets fix the bug and hope this issue won’t come up again afterwards, then it likely was related - otherwise it wasn’t.

No, I understand that the bug may be unrelated :wink:. I just wanted to confirm and add the information that I had this problem both before and after enabling sendFullIndexOnUpgrade.

One thing that I am thinking about is that however hard I try, I cannot reproduce this in my test configuration. The problem is that the test config uses just 3 devices and 1 folder, while in the real life I get these issues in a network of ~10 devices and tons of folders. I may need to add more folders and more devices to my test config in order to actually be able to get to something meaningful. The connection quality also differs much, as the test config runs on my local computer, but the real devices are located in different countries, use patchy network, etc.

1 Like

I would just like to give a quick update. I have now set up a network of 9 instances of Syncthing, all connected with each other and sharing the same folder. However, I am still unable to reproduce the issue. I have tried upgrading one instance at a time, and then all of them at the same time, but everything has always eventually stabilised with no errors.

I guess that I will have to wait for v1.14.0 and see what happens during the next upgrade in my real network :fearful:.

I am sorry for another bump, but yet one more device/folder has got stuck in the same way. The difference is that this time I did not upgrade/change the Syncthing binary on any of the involved devices. This would mean that the problem may have nothing to do with the upgrade process at all, but rather something else causes it.

One possibly important note in this case is that the device in question is used only sporadically, i.e. usually turned on every few days for just a few minutes to sync the files, and then turned off completely. Also, all the folders are set to “Receive Only”. There are no nested folders or other non-standard configurations involved.

This is how the situation looks in details.

  1. Device A (mentioned above) - all “Up to Date”.

  2. Device B and Device C - both trying to push the same already synced files to Device A.

I have also queried the REST API, and here are the results.

rest-A.txt (1.0 KB) rest-B.txt (1.0 KB) rest-C.txt (1.0 KB)

The actual differences between the three are as follows, in the order of Device A, B, and C.

Is there anything in this information that could help in the way to find the actual culprit leading to this behaviour?

I am new to Syncthing so I have only been running v1.13.x but I wonder if you have encountered the same issue as myself. Please try go into Remove Device on EACH computer, un-share the problem folder and Save. Go back into Remove Device and re-share the problem folder and Save. This fixed the issue 100% for me and I’ve not further problems since with files syncing. I have had to do with for every new device I’ve set up, whether that is a Linux PC, Windows PC or even Android device.

As all three devices show the same output (despite the difference pointed out, which are not relevant for the “syncing algorithm”), this is a case of device A not sending indexes, or B and C consistently “loosing” those indexes. Problem as usual is that the key information is how we ended up in this state, for which there’s few if any pointers after the fact. In any case don’t feel sorry for repeated reports, their definitely valuable, especially with the description of the circumstances. Maybe with time a pattern emerges or some hint triggers an idea that leads to the solution. I assume a delta index reset on device A will get rid of the issue for now.

1 Like

I am thinking about enabling some debugging options and simply running with them until the problem occurs again. Is STTRACE=db,model enough, or should I add something else to the mix?

That’s enough. db is the main one where I see the chance of getting valuable info. I am sure you don’t need the reminder, but for the potential benefit of someone else reading this: Using these debug tracing settings in production is not something to do in general, it will produce a ton of logging (easily GBs).

Yes, I have enabled STTRACE=db,model on one device only, for now. This particular device has very little traffic, so the log files should not be that large. However, this device has also experienced this issue multiple times so far, so I am quite certain that it will do it again, probably quite soon.

1 Like

Just a quick update. I have now upgraded all devices to v1.14.0 (self-compiled). I did have to reset the databases, etc., but right now everything is 100%* in sync.

I have also additionally enabled STTRACE=db,model,versioner on my main machine, which has tons of folders, so the log files will be massive, but I am running Syncthing with

syncthing -logfile=default -logflags=3 -log-max-old-files=9

so the logs should take only ~100 MB maximum. Hopefully, I will be able to spot the issue before the older logfiles are deleted. I’m now going to keep observing the situation and come back with some debug logs if the files get stuck in sync again.


* Except for https://forum.syncthing.net/t/folder-stuck-in-sync-and-non-matching-local-and-global-states/16431, but this is a non-issue really.

I regularly get these errors (or something similar) but I’ve stopped cleaning the database as it was affecting folder / devices that were ok and causing vast amounts of disk thrashing when restarting Syncthing. Now, I unshare the folders from the affected device, let the db clear out the data until it shows ‘unshared’ and then re link it.

Most of the time the local ‘device’ says the remote device is wrong, and usually it’s a remote folder that has sync conflicts, but any remaining files that are still wrong are probably caused by file / directory renames and those are the hardest to clear without the db scrubbing the data.

Everything was working fine for more than a week, but I have encountered the problem again today, and I’m almost sure that this is somewhat related to having nested folders. I had actually tried to get rid of all nested folders some time ago, but one was still left in place. Today, I decided to get rid of the last one too.

The situation initially looked somewhat like this.

Device A
  Folder1
    Folder2
Device B
  Folder1

Then, I paused Folder1 on Device B, and shared Folder2 from Device A to Device B. I accepted the folder on Device B and let it sync (or rather index the existing files). Then, while still paused, I removed Folder1 from Device B. After that, I noticed that Device B was stuck trying to sync the files from Folder2 to Device A.

The final state with the files being stuck looked like this.

Device A
  Folder2
Device B
  Folder2

Unfortunately, I didn’t have any debug logs enabled on Device B to provide more information, but I have a feeling that I may not get these errors anymore now, once the nested folders are completely gone.

I believe that I can safely say now that the problem is completely gone. The issue was likely caused by having nested folders shared between multiple devices. After restructuring my configuration and getting rid of all nested folders, there are no more files stuck in sync.

I’m still going to do some testing sometime in the future in order to try to find the actual culprit, as I do have a few ideas in mind about what could cause this, but for now I think that we can call the issue solved (or at least worked around…).

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.