A couple of days ago I discovered corrupted files within syncthing replicated folders. Some plain text files had their contents replaced with seemingly completely random gibberish binary data. Replication was of course disabled immediately.
My setup should basically be understood as having a central fileserver as hub with spokes going out to all “clients”. I do understand that is not exactly how things work, but this mental model has value. The central server has mirrored disks with enough terabytes to be configured keeping plenty of copies of older file versions, while the other devices have no .stversions/ folders at all. It is an environment with mixed operating systems. Generally running the latest packaged version of syncthing, which likely was not the cause of files getting thrashed.
Running find . | grep '.stversions/.*~20250414' | wc -l tells me that there should likely be less than 86 corrupted files. A manageable amount to resolve, but prior to starting I ask; Would anyone have seen or sit on any scripts or tools around which helps to recover a bunch of backed up files from .stversions/? Maybe that question is pretty much answered in my next section.
I have a strong theory of why corrupted files has happened to me, but am not fully sure. Thus it would be interesting, at least theoretically so, to have a way to determine the source of the corrupted files. The Data corruption thread claims it should be possible to discover the origin of data corruption using some API. Should this be understood as the modifiedBy property returned by calling GET on the /rest/db/file-endpoint?
As the database does contain the origin of files, would it be feasible and/or desirable to add the first seven characters of the device identifier to their filenames under .stversions/ similar to how conflicting files get named? If not as core functionality, then at least through using a custom versioning script which I believe currently is not able to do so?
Have you tried using the GUI? You can filter versions by date, and there is also a search box to look for specific filenames. You can restore files in bulk too.
Thanks for your answer and suggestion. But fiddling with files using something as opaque as a browser gui seems like a guarantee for data-loss in a situations like this. I prefer understand what is going on.
I ended up with this little script:
#!/bin/sh -eu
#
# Takes a syncthing folder id as its first argument, and then a list of backup
# files from under `.stversions/`. Queries syncthing for information of the
# origin of the current file version and asks the user whether to restore that
# file or not.
#
# Assumes syncthing is configured to expose its api on a unix domain socket,
# among other things. Do read and understand the code before even considering
# running it. It's short after all. Less than 50 lines!
_folder="$1"; shift
_gui="$( grep 'gui enabled' -A 2 "$HOME/.config/syncthing/config.xml" )"
_sock="$( echo "$_gui" | sed -n 's/.*address>\(.*\)<.address>/\1/p' )"
_api_key="$( echo "$_gui" | sed -n 's/.*apikey>\(.*\)<.apikey>/\1/p' )"
unset _gui
n=1
for _backup in "$@"; do
clear
_file="$( echo "$_backup" | sed 's/~[0-9]\{8\}-[0-9]\{6\}//' )"
_output="$( curl --silent -X GET -H "X-API-Key: $_api_key" \
--unix-socket "$_sock" "http://_/rest/db/file?folder=$_folder" \
--url-query "file=$_file" |
sed -n 's/.*modifiedBy.*"\(.*\)",/\1 /p'; file "$_file" )"
_answer=''
(
echo "$_output" | tr -d '\n';
echo '';
cat ".stversions/$_backup"
) | less
[ -e ".stversions/$_backup" ] || continue
printf '[%d/%d] Restore %s from %s (y/N) ? ' "$n" "$#" "$_file" "$_backup"
read -r answer
case "${answer:-}" in
'y' | 'Y')
echo "Restoring $_file"
mv ".stversions/$_backup" "$_file"
# The next line did not exist when I ran this, but must be.
# touch "$_file"
;;
*)
echo "Okay, keeping it"
;;
esac
unset _answer
n=$(( n + 1 ))
done
unset _backup _file _folder _output
When running it, I got the impression that everything worked. However for some reason syncthing isn’t picking up the changes. Not even after fully killing, restarting and waiting for a couple of hours after rescanning completed. It seems I need to explicitly touch every restored file. The moment I run touch, syncthing picks it up and replicates it, but until that happens that api endpoint keeps returning the outdated timestamp for the modified property rather than the correct value.
This sure seems like a bug to me. While I could understand if rename() calls for some reason are tricky to detect, but the full scan after a restart surely should detect the change. Right?
This morning I have realized that there is a /rest/folder/versions endpoint which can be used to return a huge list of file when called with GET (it does not document taking any selection parameters) or be used to restore file(s) with POST. Would using this API have advantages over simply doing mv …~YYYYMMDD-HHMMSS …? Should that force trigger detection of the file change even if the scanner, or what you call it, would fail to do so?
Thanks for clarifying how scanning works. I was indeed having the misconception that all files were rehashed with a frequency of rescanIntervalS. I read that “The only way to force a rehash of everything would to be reset the database” which definitely does not seem like something I would wish to do.
My corrupted files came from fsck, which means the mentioned metadata properties appears to have remained constant. Still the corrupted files were detected as new and got replicated. However after restoring they do not get picked up. I fail to understand why. It can’t be explained by inotify, as syncthing was not running at the time of doing fsck. If wildly speculating, would perhaps changed inodes be a possible factor?
Both the laptop which corrupted the files and the server which fails to detect the restored ones run syncthing 1.19.2~ds1-1+b4, which admittedly is quite old even if being the “latest packaged” in the Debian world.
Understanding that this is not considered a bug which can be helped by me debugging the current state, I’ll update mtime of all restored files in a few hours.
I would say a key issue with restoration of fsck-corrupted files really boils down to:
My level of understanding is far from sound, but the feeling I get from reading the code is that even files restored using the API would fail to replicate, right?
From what I can see, walkRegular() returns nil without sending any hash request is case the file is considered unchanged. Most relevant here is a call to mustRescan(). This is in the code path of the scanner.
When restoring a file, the exact code path traversed depends on the versioning strategy used. It essentially runs down to restoreFile() being called. This function seemingly merely restores the contents and metadata, relying on the scanner to pick up those changes. Something which will not happen if fsck merely corrupted the contents and not the metadata. Isn’t there some code missing in RestoreFolderVersions() if syncthing are fully serious about being Safe From Data Loss? Some kind of side-channel message to set FlagLocalMustRescan high for the restored file? Possibly just an added runner.ScheduleForceRescan(file) inside of the loop?
Unless I’m mistaken, the current behaviour would suffice for cases where users, like me, don’t trust unknown software and do restoration themselves. However since an API and a web user-interface is provided, the functionality should preferably be rock-solid.
Am I onto something or completely wrong? Would my understanding be close enough that a patch as lined out in this post would be appreciated, or would another way of fixing the bug be preferred?
Could this thread be moved into development by a forum administrator? I don’t really know what is possible with discourse, but that seems like a better category in light of what has been discovered.
For there to be a versioned file to begin with Syncthing must have noticed a difference and synced a file. Hence I think the versioned file will by definition differ somehow from the base file.
Yes, correct. Please see the first sentence of my initial post in this thread.
However, for some reason that does not lead to the following being true:
Please see how my restoration has still not been detected after I ran mv ".stversions/$_backup" "$_file" which to the best of my understanding would be equivalent to the operation Syncthing does.
This thread is about what has actually happened to me, not about a hypothetical case.
Why the change got detected in the first place is the biggest mystery to me. Maybe it’s an intricate difference between how the filesystems treat something? The laptop runs ext2 (which obviously ought to have been ext3 or ext4, but that’s beside the point), the server uses btrfs. I would very much like to find the answer to why the data-corruption started to spread.
Another imaginable theory is that the laptop were multiple versions behind and thus skipped a few when updating, but no. Out of the 86 changes of that day, 45 were corrupted files. Some of those had not otherwise seen changes for over a year.
Regardless of whether you might think it is required to add an explicit call to ScheduleForceRescan(), I’d say my experience proves that it likely is in some poorly understood but real corner case.
An interesting question to ask is; Would defensive coding hurt here? The main difference is that during the next scan an early boolean comparison would cause a rehash scheduling rather than waiting for comparing the struct, right? With that said, my understanding is that only tests forcefully trigger files for rescan right now. So maybe that functionality just isn’t considered production grade?
So syncthing restores mtime back to the value of the backup, while mv preserves the same value. Or to be more precise, restoreFile() uses a thin wrapper to call os.Chtimes() with the same sourceMtime value as the original file had. A value corresponding to struct timespec st_mtim documented to not be updated by rename().
That sure seems like functionally identical to me.
We can probably disregard thrashcan and external for this thread, as they work very differently and would never be bitten by this bug.
That bit corresponds to the FlagLocalMustRescan bool, right? Did I misunderstand more than runner.ScheduleForceRescan() being the way to do that?