Bit Rot Prevention Scheme

I believe syncthing could protect against bit rot. Consider the following scheme, which could be used as a pre-processing step for any syncronization tool.

On each device store a record of the time stamp and checksum for each file. This information will be used to detect bit rot. Consider the following example.

Suppose device A has a corrupted file. Then device A could verify that the file is corrupt by checking against its records (the timestamp would be the same but the checksum would be different). To restore the file, we would then just retrieve it from device B.

Obviously, this scheme would take longer to execute but it wouldnā€™t need to run as frequently, especially, since most syncronization tools donā€™t syncronize files with the same timestamp, i.e., the corrupted wouldnā€™t be propagated unless the file was changed before the next scan for bit rot.

How do we determine the difference between bit rot and intentional change?

Also, of you are concerned with bit rot, why not use tools that are meant to check for bitrot?

The fact you can do something, doesnā€™t mean you should do it.

We could provide tunneling capabilities over the sync/relay etc protocols, but I donā€™t think we should.

The timestamp.

The reason why we should is because detecting bit rot and fixing bit rot are two different things. Combing a bit rot detection scheme with a syncronization scheme allows us to replace the rotten files with one of the ā€œgoodā€ redundant copies in the network.

There are unfortunately lots of cases where the contents change but not the time stamp, enough so that weā€™ve had to expend code in a whole way to detect this on the fly. Mem mapped files, low precision time stamps, metadata editors, secure containers, etc.

Thatā€™s not to say we couldnā€™t have a mode where you say ā€œIā€™m certain the data shouldnā€™t have changed; check it all and undo all differencesā€. We kind of already have that in receive only mode, itā€™s just the scan part thatā€™s lacking. If another tool did that and deleted the bad files, Syncthing would redownload them after someone clicked ā€œrevertā€.

3 Likes

Hmm, that does seem to be an unfortunate limitation. Overall tho, bit rot would be a rare occurrence, if we were able to limit the scope of this tool to files that we knew were corrupt, e.g., static files like a raw photo, then we would still be ahead.

One possible mitigation would be to only raise a bit rot error if the timestamp has been recorded into our records and synced across multiple devices (we canā€™t do anything if we donā€™t have a backup anyways), say only check for bit rot errors after 24 hours. There may be other timestamp limitations that Iā€™m unaware of but I imagine this would address a number of them.

Example:

  • Assumption for this example system:
    • All files are syncronized within 24 hours.
    • Our system checks for bit rot errors on files that have a timestamp older than 24 hours.

Then if file01 gets corrupted at hour 25 (or later) we would be able to catch this error.

This system obviously wouldnā€™t catch bit rot errors that occurred close to the fileā€™s timestamp but that should still handle the lion share of errors.

There is another feature request to do a full always-rehash scan, to catch modifications when none of the metadata has changed. If we did that, it could be used for bitrot detection by combining it with a receive-only folder mode. So thatā€™s where Iā€™d suggest starting.

3 Likes

It would be awesome if syncthing could claim some element of bit rot protection. Let me know if I should participate in a github discussion or something.

When bit-rot occurs and there is only one source file Reed Solomon can be used to repair the rotted bits. Backblaze and Minio uses this to recover from rot.

I also think this would be a cool feature, especially since many of the items Iā€™ve synced are also finished projects, photos, movies, music, software packages, disk images, etc - that rarely change. Due to cpu usage, Iā€™d prefer a user-settable range, and Iā€™d probably choose to do re-check every 2-3 months (probably up to 6 months, depending on the exact content).

I would also agree that it would be best per-folder, so that you could re-check important files more often, as well as not having to suddenly re-hash several TB of data at once.

I would propose the ability to mark a folder as an ā€œarchiveā€. If a folder has been marked as archive:

  • New files can be added to the folder and they will be synced to other devices without prompting
  • If a file is modified or deleted, syncthing will prompt the user for approval before syncing changes to other devices.

This would prevent bit rot from propagating to backups, as well as preventing things like unintentional deletions or ransomware encryptions from propagating. No need to do any fancy bit rot detection stuff, just allow the user to express intent that files in a given directory are not supposed to change.

Replying here as itā€™s the newer thread, but note the quote^ from the other similar post. This quote is identical to my dilemma; as I understand it theyā€™ve described the problem well. My server has a robust file system (btrfs 3-copy ~raid in my case) and backup scheme, devices that require full write (laptops of the group, and for some shares phones) simply donā€™t support the file system level redundancy & repair.

Itā€™s not clear to me how to address this, in the general case, if some devices with write access to a share do not have the same level of robustness. User interaction / hoops seem required given how timestamps are not always meaningful.

A few thoughts Iā€™ve had. Iā€™d agree with shifting burden to user over syncthing if at all possible, but it does seem unavoidable for syncthing to

  • a [new] third folder type: create/delete only. All files are read-only, allowing a checksum fail to be always interpreted as ā€œnot modified, therefor corruptā€. This works close to perfectly for large use cases such as media backup. On the rare occasions with edits, itā€™s simple enough to create new files and delete the ā€œmodifiedā€.
  • file-level read-only. The user has to mark a file write. I do not propose ā€œcheckoutā€ just change attributes to make it write globally, a synced value. Sure checkout might be nice, but would be a much more complicated feature, I would think? This change couples nicely with my variation of ā€œarchiveā€
  • files in a share become ā€œarchiveā€ (== read-only) after some time window. If syncthing used the fileā€™s read-only attributes to ascertain that any checksum failures are corruption, then this could actually be a batch job on the server (chmodā€™ing old files to read-only) and the syncthing responsibility is limited?
    • (linux respects this, afaict - iir windows has ways for the user to ignore? Iā€™d be fine with that limitation)

As I consider risk to older but critical data, Iā€™m growing very concerned about this :frowning:

A good backup scheme is definitely the best line of defense against bit rot, so youā€™re already ahead in that regardā€¦ :grinning:

A blog post from Backblaze says it all: The 2022 Backup Survey: 54% Report Data Loss With Only 10% Backing Up Daily

Many who do have backups rarely ever verify that their backups are actually any good. While a backup archive is being written, there are all kinds of things that can result in a corrupted archive including flaky RAM, flaky storage, and even a flaky network card.

Hereā€™s what I do:

  • A 3-2-1-1-0 backup scheme: 3 copies, 2 different types of storage media, 1 off-site copy, 1 offline/air-gapped copy, 0 detectable errors when testing backup archives.
  • For critical data that needs to be retained long term but isnā€™t being updated anymore, and where the file format doesnā€™t already include built-in error detection, Iā€™ll generate SHA256 hashes.
  • Like you, I also use BTRFS. Itā€™s my preferred filesystem for my NAS, external HDD/SSD, and even SD cards (with a quality 128GB SD card costing less than $20, itā€™s very convenient to leave one plugged in on a laptop for local backups).
  • On active storage, I use a combination of btrfs scrub and smartd to help decide when itā€™s time to rotate out a HDD/SSD (to minimize e-waste, I repurpose ā€œretiredā€ but still functioning storage for offline archives).

While I use Syncthing to get data to my NAS, a different program handles backups.

2 Likes

Iā€™m honestly not really sure what the role of Syncthing would be here.

I care a lot about preventing bit rot, and I do that by having hardware that can help (ECC memory is essential here) and storage that ensures integrity (ZFS; I donā€™t know modern btrfs but Iā€™ll assume itā€™s equivalent for the sake of argument).

With that, there will be no permanent changes to data on disk due to bit flips on the SSD or whatever, because it will get detected and fixed on read or scrub. (If they even happened to begin with, as thereā€™s ECC and stuff at the disk layer, too, and CRCs on the SATA/NVME protocol, etcā€¦) There should be no random/intermittent bit errors on the data after itā€™s been read (and hash-verified by ZFS) because ECC.

On top of this, of course, Syncthing already hashes the read data and rejects incorrect blocks. Reading data thatā€™s different from the recorded hash will trigger a re-hash of the file though, as we assume itā€™s changed without the timestamp etc updating. That would be bad if you had permanent on-disk bitrot, but, with a proper filesystem you wonā€™t have that, so ā€¦ Itā€™s not like Syncthing periodically re-hashes all files to actively push out any random bit flips to other devicesā€¦

It seems to me that Syncthing is the wrong layer entirely for mitigations.

9 Likes

Git-annex sounds like a better choice for detecting bit-rot and managing archival storage.

Hello,

Iā€™m a happy user of Syncthing since some years. I recently got some time for private projects, and dug into the topic of bitrot. Iā€™m familiar with the ideas of ZFS and BTRFS. I have a QNAP NAS with 4 disks in RAID6 with regular scrubbing at home. I also had a look into cloud-filesystems which also have redundancy and techniques for scrubbing the data to detect and fix bitrot. The topic of ECC RAM I also came across.

I stumbled across the ā€œgarageā€-project.

And I like theirs dedication on building a software solution that works without the need of special hardware. Directly on theirs entry-page (you need to scroll down), one can read the hardware requirements:

Keeping requirements low Build a cluster with whatever second-hand machines are available

And at the same time they claim to be

Highly resilient to network failures, network latency, disk failures, sysadmin failures

Of course, this project is not directly comparable to Syncthing, but I can see a few common things:

  • Distributed, connected devices that automatically exchange static/rarely changing data which is protected from bitrot and accidents by redundancy. Of course, the part with having ā€œdata protectionā€ is only halfway true for Syncthing. And Synchting targets a different set of usecases, but nevertheless I was wondering if a ā€œsimpleā€ bitrot protection could be done with Syncthing as well. I stumbled across this discussion here and understood that the biggest issue with including it into Syncthing is the problem with the detection of bitrot.

Or, being more precise: The differentiation if a change was done intentionally or caused by bitrot.

But what if this could be solved externally? Lets just assume for a moment that this would be possible with external tool support. Then Syncthing would need a small API that could be used to inform about file, directory or metadata corruption. Then Syncthing could simply restore the corrupted parts by downloading them from other nodes. If I would like to implement this, where would I need to start within Syncthing?

I have an idea on how to get the external detection running even on limited devices like a smartphone. It includes to use the help of the filesystem directly. Sadly, I could not find anything in this direction so far. So this means one needs to do a small, but important extension to existing filesystem implementations. It should be a small change for any filesystem that already supports extended attributes (xattr). And it seems to me that any major and modern filesystem supports it directly or indirectly. The idea is, that Synchting uses xattrs to store the latest computed hash for a file additionally within the filesystem itself. It uses a name for that xattr that tells the filesystem to delete it whenever there is an intentional change done to the file via the filesystem API. This way, Syncthing can detect any intentional change by the nonexistence of the otherwise existing xattr.

To me, this seems to be a rather small change to any existing filesystem implementation. Also the performance overhead should not be noticeable, when done correctly.

One could start with a simple decorator- or overlay filesystem, that passes trough all requests to the mounted real filesystem. And additionally clears the relevant xattrs when any modification is done.

Of course, to get this running on a regular smartphone without root access would take some time, because of the filesystem extension needed. But hey, I could imagine that other tools that do watch the filesystem for changes could also make use of this. Initially we would use the decorator overlay filesystem. Eventually, we would use directly the filesystem if it supports it.

OK, this actually seems to be a quite ambitious goal. But what about just the general idea of providing an API that allows to flag files, directories or metadata to be corrupted and thus to be replaced as soon as possible? It would be a starting point for allowing any kind of external detector tools. What do you think?

1 Like

I created a small proof of concept. Its not fully working yet, but it demonstrates the idea with the overlay-fs and the automatic removal of the xattrs in case of intended change. And it shows that a integration into existing synchting code would actually not be so difficult. It has even another advantage: It doesnā€™t need the Inotify mechanism anymore are all accesses and such all modifications are handled directly in the fuse interface implemenation.

BTW: I know that there is a project ā€œsyncthingfuseā€ that could be interesting in this context as well as itā€™s also using fuse filesystem. But this project is outdated. Probably because to much maintenance effort due to a lot of duplicated code from the original synchting codebase. I would like to avoid this by implementating the changes directly in the synchting codebase.

Thanks to the person that liked my previous post. It would be cool if more people are interestedā€¦

3 Likes

Unfortunatly I must agree on this statement. Bit-rot detection (and recovering) is not a single apple to be peeled. Modern CoW filesystems like robust ZFS and working with errats btrfs are a better piece of software at the right place in the software stack to detect and (hopefully) repair bit rot.

Some other co-existing tool could be written to create a know-good crypto safe hash over every file. Then store the hash table. And rescan once in a while and diff the hashes. But that is a userspace implementation, and hopefully not done in some crappy ā€œbetter modern languageā€ with 1 milion dependencies.

Enterprise grade systems and software do this for installed operating system integrity checking if the OS was modified (by malware). And then you can lockdown things.

2 Likes