Bit Rot Prevention Scheme

I believe syncthing could protect against bit rot. Consider the following scheme, which could be used as a pre-processing step for any syncronization tool.

On each device store a record of the time stamp and checksum for each file. This information will be used to detect bit rot. Consider the following example.

Suppose device A has a corrupted file. Then device A could verify that the file is corrupt by checking against its records (the timestamp would be the same but the checksum would be different). To restore the file, we would then just retrieve it from device B.

Obviously, this scheme would take longer to execute but it wouldn’t need to run as frequently, especially, since most syncronization tools don’t syncronize files with the same timestamp, i.e., the corrupted wouldn’t be propagated unless the file was changed before the next scan for bit rot.

How do we determine the difference between bit rot and intentional change?

Also, of you are concerned with bit rot, why not use tools that are meant to check for bitrot?

The fact you can do something, doesn’t mean you should do it.

We could provide tunneling capabilities over the sync/relay etc protocols, but I don’t think we should.

The timestamp.

The reason why we should is because detecting bit rot and fixing bit rot are two different things. Combing a bit rot detection scheme with a syncronization scheme allows us to replace the rotten files with one of the “good” redundant copies in the network.

There are unfortunately lots of cases where the contents change but not the time stamp, enough so that we’ve had to expend code in a whole way to detect this on the fly. Mem mapped files, low precision time stamps, metadata editors, secure containers, etc.

That’s not to say we couldn’t have a mode where you say “I’m certain the data shouldn’t have changed; check it all and undo all differences”. We kind of already have that in receive only mode, it’s just the scan part that’s lacking. If another tool did that and deleted the bad files, Syncthing would redownload them after someone clicked “revert”.

2 Likes

Hmm, that does seem to be an unfortunate limitation. Overall tho, bit rot would be a rare occurrence, if we were able to limit the scope of this tool to files that we knew were corrupt, e.g., static files like a raw photo, then we would still be ahead.

One possible mitigation would be to only raise a bit rot error if the timestamp has been recorded into our records and synced across multiple devices (we can’t do anything if we don’t have a backup anyways), say only check for bit rot errors after 24 hours. There may be other timestamp limitations that I’m unaware of but I imagine this would address a number of them.

Example:

  • Assumption for this example system:
    • All files are syncronized within 24 hours.
    • Our system checks for bit rot errors on files that have a timestamp older than 24 hours.

Then if file01 gets corrupted at hour 25 (or later) we would be able to catch this error.

This system obviously wouldn’t catch bit rot errors that occurred close to the file’s timestamp but that should still handle the lion share of errors.

There is another feature request to do a full always-rehash scan, to catch modifications when none of the metadata has changed. If we did that, it could be used for bitrot detection by combining it with a receive-only folder mode. So that’s where I’d suggest starting.

2 Likes

It would be awesome if syncthing could claim some element of bit rot protection. Let me know if I should participate in a github discussion or something.

When bit-rot occurs and there is only one source file Reed Solomon can be used to repair the rotted bits. Backblaze and Minio uses this to recover from rot.

I also think this would be a cool feature, especially since many of the items I’ve synced are also finished projects, photos, movies, music, software packages, disk images, etc - that rarely change. Due to cpu usage, I’d prefer a user-settable range, and I’d probably choose to do re-check every 2-3 months (probably up to 6 months, depending on the exact content).

I would also agree that it would be best per-folder, so that you could re-check important files more often, as well as not having to suddenly re-hash several TB of data at once.

I would propose the ability to mark a folder as an “archive”. If a folder has been marked as archive:

  • New files can be added to the folder and they will be synced to other devices without prompting
  • If a file is modified or deleted, syncthing will prompt the user for approval before syncing changes to other devices.

This would prevent bit rot from propagating to backups, as well as preventing things like unintentional deletions or ransomware encryptions from propagating. No need to do any fancy bit rot detection stuff, just allow the user to express intent that files in a given directory are not supposed to change.

Replying here as it’s the newer thread, but note the quote^ from the other similar post. This quote is identical to my dilemma; as I understand it they’ve described the problem well. My server has a robust file system (btrfs 3-copy ~raid in my case) and backup scheme, devices that require full write (laptops of the group, and for some shares phones) simply don’t support the file system level redundancy & repair.

It’s not clear to me how to address this, in the general case, if some devices with write access to a share do not have the same level of robustness. User interaction / hoops seem required given how timestamps are not always meaningful.

A few thoughts I’ve had. I’d agree with shifting burden to user over syncthing if at all possible, but it does seem unavoidable for syncthing to

  • a [new] third folder type: create/delete only. All files are read-only, allowing a checksum fail to be always interpreted as “not modified, therefor corrupt”. This works close to perfectly for large use cases such as media backup. On the rare occasions with edits, it’s simple enough to create new files and delete the “modified”.
  • file-level read-only. The user has to mark a file write. I do not propose “checkout” just change attributes to make it write globally, a synced value. Sure checkout might be nice, but would be a much more complicated feature, I would think? This change couples nicely with my variation of “archive”
  • files in a share become “archive” (== read-only) after some time window. If syncthing used the file’s read-only attributes to ascertain that any checksum failures are corruption, then this could actually be a batch job on the server (chmod’ing old files to read-only) and the syncthing responsibility is limited?
    • (linux respects this, afaict - iir windows has ways for the user to ignore? I’d be fine with that limitation)

As I consider risk to older but critical data, I’m growing very concerned about this :frowning:

A good backup scheme is definitely the best line of defense against bit rot, so you’re already ahead in that regard… :grinning:

A blog post from Backblaze says it all: The 2022 Backup Survey: 54% Report Data Loss With Only 10% Backing Up Daily

Many who do have backups rarely ever verify that their backups are actually any good. While a backup archive is being written, there are all kinds of things that can result in a corrupted archive including flaky RAM, flaky storage, and even a flaky network card.

Here’s what I do:

  • A 3-2-1-1-0 backup scheme: 3 copies, 2 different types of storage media, 1 off-site copy, 1 offline/air-gapped copy, 0 detectable errors when testing backup archives.
  • For critical data that needs to be retained long term but isn’t being updated anymore, and where the file format doesn’t already include built-in error detection, I’ll generate SHA256 hashes.
  • Like you, I also use BTRFS. It’s my preferred filesystem for my NAS, external HDD/SSD, and even SD cards (with a quality 128GB SD card costing less than $20, it’s very convenient to leave one plugged in on a laptop for local backups).
  • On active storage, I use a combination of btrfs scrub and smartd to help decide when it’s time to rotate out a HDD/SSD (to minimize e-waste, I repurpose “retired” but still functioning storage for offline archives).

While I use Syncthing to get data to my NAS, a different program handles backups.

1 Like

I’m honestly not really sure what the role of Syncthing would be here.

I care a lot about preventing bit rot, and I do that by having hardware that can help (ECC memory is essential here) and storage that ensures integrity (ZFS; I don’t know modern btrfs but I’ll assume it’s equivalent for the sake of argument).

With that, there will be no permanent changes to data on disk due to bit flips on the SSD or whatever, because it will get detected and fixed on read or scrub. (If they even happened to begin with, as there’s ECC and stuff at the disk layer, too, and CRCs on the SATA/NVME protocol, etc…) There should be no random/intermittent bit errors on the data after it’s been read (and hash-verified by ZFS) because ECC.

On top of this, of course, Syncthing already hashes the read data and rejects incorrect blocks. Reading data that’s different from the recorded hash will trigger a re-hash of the file though, as we assume it’s changed without the timestamp etc updating. That would be bad if you had permanent on-disk bitrot, but, with a proper filesystem you won’t have that, so … It’s not like Syncthing periodically re-hashes all files to actively push out any random bit flips to other devices…

It seems to me that Syncthing is the wrong layer entirely for mitigations.

7 Likes

Git-annex sounds like a better choice for detecting bit-rot and managing archival storage.