I am wondering how data corruption is handled by Syncthing.
I have two systems, both on TrueNAS SCALE. So that means a ZFS filesystem is being used.
One system has mirrored drives, which means ZFS automatically detects and corrects corrupt data. However the other system does not have a duplicated drive, this means that ZFS can only detect corrupt data, but is unable to correct the issue.
Now, here comes Syncthing. What happens if data becomes corrupt on the system with a single drive.
Does it:
a. Find out from ZFS that the data is corrupt and request a copy from another node?
b. Detect that the change to the data is corruption, and prevent it from being synced back to the other node
c. Not detect any change in data. So just acts like nothing is new.
d. something else.
Thanks. I am hoping to figure out how these things work to protect the integrity of my data over long periods of time. I know ZFS will do this with mirrored drives, but wanted to check with you if/how using Syncthing might impact this.
Well the first thing to know is that check-summed filesystem like ZFS will never “hand out” data it knows is corrupt. Any application trying to read a data block that fails checksumming within ZFS (with ZFS being unable to correct) will yield an I/O error and syncthing will not be able to read the file (which in turn will result in syncthing logging this as an error).
If we assume that syncthing is fed with incorrect data nevertheless, the outcome depends on the specific scenario.
For example, if we just assume that silent “bit rot” happens (within a file’s content), then nothing will happen at first. On a realistic note, the drive’s internal sector verification will detect this (all modern hard and flash drives checksum their sectors), but even if it didn’t, ZFS would detect it. If ZFS also doesn’t detect it, then… nothing happens (at first). It’s not like syncthing reads your entire dataset every minute. If no metadata has changed, syncthing will not know that the data has changed at all. The OS also cannot possibly send filesystem notifications for an event that is effectively invisible.
If some other syncthing node now requests the corrupted data, the next point of detection is syncthing’s own checksumming: The database has recorded the hashes of all files, and the receiver will expect the file (technically its blocks) to match that checksum. If it doesn’t do this, the receiver will refuse to accept the corrupt data, and sync will not happen. This doesn’t correct the error on the source*, but it won’t propagate further.
*AFAIK syncthing has no logic to detect this scenario and “re-download” existing data from elsewhere because of this checksum error.
IMHO this also isn’t really something to worry about much. I know that many people do, but simple bit-rot errors are already detected by your hardware and filesystem. The real errors is you loosing your entire dataset because a sector error in a critical non-redundant metadata section is always deadly. Your hardware may also fail entirely, or a software error in some program may corrupt data in a way that the filesystem doesn’t know it’s an error. Nothing substitutes backups, backups and more backups.
Ok, thanks for the thoughtful answer. I appreciate it. It looks like Syncthing using checksums of the files gives an added layer of security as well, so thats appreciated.
Interesting that the when Syncthing detects checksum errors in its data it does not request another copy.
Your larger perspective is also much appreciated, for thanks for going beyond what I asked
This is effectively a side effect of how syncthing works. It’s a pull model, not a push model. If node A requests data from node B and that data fails checksums, that’s an error on node A (not B). All devices (nodes) in syncthing are essentially equal and independent, so an error that happens on A doesn’t concern B (in fact, B doesn’t even know about this error). Yet in our hypothetical scenario B is the “cause” of the error on A.
Syncthing would need additional logic such that either each syncthing node performs a live verification on all outgoing data (which is expensive performance-wise, and highly unlikely to ever be useful), or A would have to “report back” the error to B, which would then have to verify the data to reproduce A’s error. This is a lot of logic that simply isn’t there.
Yeah. And in the case of corruption + ZFS, we’ll never see data that doesn’t match the hash - we’ll get correct data, or we’ll get an I/O error.
Syncthing would have no way of knowing whether the checksum “difference” was due to a file system error or a changed file. Absent a “read error” thrown from the file system syncthing would reasonably assume the file was changed perhaps while syncthing wasn’t running and a rescan would propagate the “new changes” to the rest of the peers.
The file system really has to be responsible for the integrity of the data. Otherwise many things start to fall apart. Not only syncthing.
It would be interesting for syncthing in the event of a read error to force a re-download of the file, and if successful, delete the file with the read error and move the newly downloaded copy into its place.