Huge issue with Syncthing data scrambling

m1cha · August 18, 2019, 9:50am

Hello everyone, I have a huge issue with Syncthing and data scrambling.

A brief overview over my systems involved in this issue:

PC, Manjaro, Syncthing 1.2.0
Home Server, Arch Linux, Syncthing 1.2.0
Notebook, Arch Linux, Syncthing 1.2.1 (!)
Multiple other devices like Android phones and Linux laptops I am pretty sure have nothing to do with this issue

On August 15, 2019 I did a full system update including Syncthing 1.2.0 -> 1.2.1 on my notebook. Context: I was not at home so there was no connectivity to the other devices.

After a reboot I noticed 100% CPU load for some time but ignored it as this is what usually happens after a Syncthing update sometimes when it is rebuilding its index.

When the CPU load did not reduce after quite some time I checked the Syncthing web interface and it showed “Scanning” on the first folders and all folders below showed the state “Unknown”. The systemd journal did not show anything I could Identify as abnormal at first glance.

I stopped Syncthing as I had other things to do at that moment. The next day when I started my notebook again there was 100% CPU load. I stopped Syncthing again as I was at customer site and had no time for debugging. I came back home and started my notebook and simply let it run for an hour. When I came back it was still running on 100% CPU load and the web interface still showed “Scanning” on the first folders and “Unknown” on the others. This definitely did not look normal.

I checked the folders on my notebook and at first glance they looked OK. But when I checked the folders on my home PC and my Home server I discovered something really bad. All Syncthing folders had lots and lots of directories and files from all other Syncthing directories scrambled into them. Files and directories definitely only present in one Syncthing directory before appeared in almost all other directories as well.

I immediately stopped Syncthing on my notebook and the Synchronization craze stopped and all Systems came to a stable state (besides the fact that I have “Out of Sync” messages on many folders…).

This is the current state. On my other computers and phones the synchronized folders are all scrambled up badly, on my notebook only a couple of directories and files seem to have been scrambled.

I am pretty sure my notebook is the issue with Syncthing 1.2.1. I have never had issues with Syncthing especially not as bad like these.

Fortunately I have backups of most of the data so I will eventually be fine but at the moment this is a huge issue as the synchronized data is in the scale of terabytes in about 30 directories. Also I do know how to proceed with my notebook. Will the craze continue when I update my other devices? Did anyone hear of such an issue before? What is the cause? How can I debug it? How can I avoid this issue in the future?

Maybe someone has an idea on any of the questions.

imsodin · August 18, 2019, 10:35am

First let me state this explicitly to avoid any misunderstandings: I am genuinely interested to find the cause. While I am currently thinking (and hoping) that there’s a problem with your setup, not with Syncthing in general, I still want to know what happened, because if it was Syncthing’s problem, that’s obviously very bad and needs fixing asap.

What did that “full system update” entail? Any changes to disks/mountpoints/…

The “Unknown” bit might be a known dead-lock problem, so might be unrelated.

Possibly, as the incorrect information has already propagated from the Notebook. After restoring the data from backup, the safest thing would be to delete the databases, but that would obviously mean lots of hashing. Another approach is after restoring, set a known good peer (restored from backup) to send-only and bring up other devices one by one. In between bringing them up, let things settle and use the override state button the known good device if necessary (this will override any “bad state” with the known good state).

Not that I am aware of and I’d like to know too. Answer to the last quesiton obviously depends on the former.

As to how to debug:

Save full logs since the upgrade on all devices. Make copies of the config files and database (best just copy ~/.config/syncthing). If your paths aren’t confidential, you can remove device IDs and share them here. Knowing your setup would remove a lot of guess-work and make supporting a lot easier. Same for logs.
If you have regular backups, check if the config changed during upgrade on notebook.

As to pure speculation:
This sounds like a problem with nested folders (folder as in Syncthing share). I.e. somehow the filesystem path structure or configured paths in Syncthing on the netbook changed during the upgrade. I am not saying that’s the case, but pay close attention to this please.

m1cha · August 18, 2019, 12:38pm

Thank you very much for your time. I am also hoping this is an issue with my setup but I am still failing to find the reason as I am not aware of any changes besides the update. But let me respond to your questions in detail:

Regarding mount points the only ones affecting my home directory where all my Syncthing directories reside are from the 2 drives (SSD and HDD) with full disk encryption unlocked at boot → /dev/mapper

/dev/mapper/fast on / type btrfs (rw,relatime,ssd,discard,space_cache,subvolid=5,subvol=/)
/dev/mapper/large on /home/micha/data/large type btrfs (rw,nosuid,nodev,relatime,space_cache,subvolid=5,subvol=/,x-gvfs-show)

Let’s first get the config file out of the way. Here is an anonymized copy of the config.xml on my notebook:

config-notebook.xml (32.2 KB)

I just checked the logs again and found that there is 74 MB of Log data since the update with lots of error messages between. I still have to anonymize the logs but meanwhile I found interesting log messages in between the log entries I saw before.

After a couple of reboots version 1.2.1 gives me first some “Completed initial scan of …” and some “… folder anonymized-dir has mismatching index ID for us …” log entries and then this:

syncthing-journal-errors.txt (95.8 KB)

Sadly this is not an option as I do not backup all files, especially not in the home directory.

As you can see I have a nested Syncthing folder. My home folder is a Syncthing folder and all other Syncthing folders reside in subdirectories of the home folder. But the home folder Syncthing folder has been paused on my devices for months since Syncthing does not behave nicely with my setup. I have an exclude file in the home directory which looks like this:

!/.stinclude
#include .stinclude
*

Excluding everything except for the files and directories in the .stinclude file. The .stinclude file looks like this:

!/.aliases
!/.zshrc
!/Templates
!/.nanorc
!/go
!/.config/autokey/data

Basically this works find except for the fact that Syncthing still scans ALL directories in the home directory and notifies e.g. on missing permissions in specific subdirectories and uses lots of system resources to do so. This is why I paused the home directory sync months ago on all systems. This could be a feature request but is nothing which has to be discussed here.

AudriusButkevicius · August 18, 2019, 3:51pm

Strangely this is not the first time we are hearing about this so it seems there is some bug somewhere.

imsodin · August 20, 2019, 8:04am

The log entries seem fine: After upgrade we drop the delta index information, to initiate a full exchange of metadata.

The latter is a crash while writing to database due to

Aug 15 22:10:28 micha-notebook syncthing[769]: panic: open /home/micha/.config/syncthing/index-v0.14.0.db/519980.ldb: too many open files

That might be due to db size (fixed in lib/db: Use different defaults for larger databases (fixes #5966) by calmh · Pull Request #5967 · syncthing/syncthing · GitHub) or something entirely different. The trace is truncated at the end - the full one might contain relevant info.

It is definitely bad (crashing in the middle of scanning), but I’d expect no problem (it just didn’t commit the information to db) or db corruption (i.e. failure to start next time around).

That setup should be unproblematic and even if it weren’t, the folder is paused.

Can you detect any pattern there?

I am currently entirely out of ideas of what might have happened. That it needed to scan and then actually transferred files requires that Syncthing folder(s) looked at different data than they should (probably parent directory from your description), which would somehow imply they internally or externally (which doesn’t seem the case) got a wrong filesystem (path) - and that’s where my ideas end.

AudriusButkevicius · August 20, 2019, 11:30am

This should have resulted in deletions too tho, did that happen?

I guess as described above, a pattern for what got merged to what would be useful.

m1cha · September 14, 2019, 10:57pm

Hi, thank you again for your support. I put off the task of analyzing the data because of travels and work but I would still like to provide you with the requested information. So here we go:

You’re right. Here is the complete stack trace: syncthing-journal-errors2.txt (249.1 KB)

I mounted the backup and did a diff to the state after partial synchronization (this is one of the issues why not everything will be consistent - I interrupted the synchronization after realizing my data is being corrupted).

This is what I found:

I was not able to spot a pattern.
There were directories and files of almost each other Syncthing directory in every Syncthing directory.
There was lots of duplicate files but not only specific files.
There were files deleted but not nearly as much there was files copied (maybe due to me interrupting the synchronization process).
Some of the deleted files appeared in other Syncthing directories.

I am still on hold with the current state if anyone would like to inspect it more closely. If it helps I can offer to do a remote debug session with one of you.

I am still putting off the task of restoring the hundreds of Gigabytes of data merging them with the changes which occurred since then since I did not immediately take the time to resolve the problem…

imsodin · September 16, 2019, 6:40am

The stacktrace looks sane, so the cause of the panic seems to really be with the database.

Please make your database (privately) available (e.g. upload it somewhere and send the link per PM). Same for logs, regardless how huge the might be, before and after the panic.

system · October 16, 2019, 6:40am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.