File Integrity Verification

sakv4y · April 2, 2019, 7:28am

This feature request arose from this thread:

The idea is that the user has the option to manually trigger a checksum based file verification process, in case the user suspects anything went wrong with the local data.

After pressing the new “Verify all files” button in the UI, Syncthing would:

warn the user that any changes to files during that process my be lost
hash every file and check it against the checksum in the index
re-download all files where there is a checksum mismatch

Read-only nodes could execute this function periodically via an api call in a cron job. This would also then report broken and fixed files. This could also mitigate bit rot on primitive filesystems.

Issue #1315 on Github already discusses this and the idea was dismissed with the main points made:

(A) This is something the filesystem should handle.
(B) There is no way to detect if changes were made by the user and therefore should be synced, or if the data should be corrected.

Also, Audrius Butkevicius’ comment from the thread where this came up:

I don’t think verify feature makes sense. If the file changed we’d know (mtime and size changes), if it hasn’t changed, what’s the point of verifying? What are we trying to catch here? Bad drives? That’s not really syncthing’s problem.

I would like to comment about the two points made on Github, as well as Adrius’ comment.

First, the easy one:
(B) Because this would be a manual feature, the user has to trigger it and can be reminded to not change any files during the process.

Now the tricky one:
(A) Yes, you are right. Theoretically, the filesystem is responsible for the integrity of the files. There are really great filesystems out there that can totally handle all these problems.

But lets look at this from a practical perspective for a moment. For most users, Syncthing is a kind of DIY thing. They don’t trust cloud providers with their proprietary software to push all their data to some cloud. Or tech savvy people that use Syncthing for personal home network. Or just office colleagues that want to sync data.

All of these people have no idea about all these fancy filesystems that would solve integrity problems. They are way to complex and time consuming to set up and maintain.

My hypothesis is, that the majority of Syncthing users do not use self-healing file systems, and therefore the argument “that this is the filesystems job” does not hold true practically.

I’d like suggest that the maintainers evaluate this a feature carefully and give it consideration. I urge you to decide what is best for the project and listen to the community - and that may also mean to deny this feature request. I am not trying to push you into this. I am just contributing an idea.

imsodin · April 2, 2019, 7:50am

Unless I am mistaken that can already be done, it’s just not a single api call:

Reset the database for the folder in question (https://docs.syncthing.net/rest/system-reset-post.html)
Reset the folder to global state with the revert endpoint (https://docs.syncthing.net/rest/db-revert-post.html or button in GUI)

Now 1 isn’t available in the web UI, but I’d argue it shouldn’t be in general as it’s not something that should be done unless there is a very good reason and it can’t be unless there is a solution to https://github.com/syncthing/syncthing/issues/3876. However I think it would be a sensible request to extend the -reset-database command line option to optionally take a parameter to specify the folder(s) to be reset.

calmh · April 2, 2019, 7:59am

The cumbersome way is available in the GUI: remove the folder and add it back (retaining the folder ID etc.).

DrSchnagels · April 3, 2019, 8:12am

I really love this idea. It would add more confidence knowing that data is bit-identically synced to one or more devices. And it would show or even prevent bitrot or non-ECC protected bitflipping data storages. ZFS-style added data security super convenient. Maybe you can add a per-folder setting: “Full-rescan & rehash after 30 days”. Or for data laying on an internet server: per device full rescan and rehash every 90 days. It’s something you could even manage remotely with Arigi. Rehash and rescan everything. Wait x hours and you can sleep better =).

imsodin · April 3, 2019, 11:03am

I actually get to see the value of this more and more when thinking about it. You first need to ensure that the folder is fully up-to-date, then stop any index exchange/syncing during the process, then by scanning first, thus picking up modifications with a change of file modtime and/or size, but not bitrot and the like, then rehashing and dropping any detected changes with data from remotes you can ensure the integrity of your data. However, it’s also a fairly complex and potentially data destroying process (I am sure there is software out there that changes file contents without changing modtime or e.g. does a file modified through a hardlink get its modtime updated?), which I don’t want a user to do unless they really understand what they are doing. And compared to understanding and assessing the risk, doing (scripting) a few rest calls takes less time.

JKing · April 3, 2019, 11:27am

Yes. Modification time is a property of the file, not the link.

luca1234567 · May 24, 2019, 1:29pm

There is a ordinary example of different file content, but with the same Modification time. You could try to:

Create a .XLS file with Micro_soft Excel, and close.
Create a copy of that file
Open the file, edit something, and close WITHOUT save. Now you have two different file, but with the SAME modification time.

I see that the first time 10years ago with the version 2000. By chance I can try with the version 2018 and I see the same issue. Personaly a stop to use it from 10years ago.

This example create the problem of this post.

In my situation where there is more then one people that work on the same shared folder, there is not practical to prevent any change to files during process. I suggest to use split that process atomicaly file-by-file and complete it for the single file. And use for example the file locking, to lock the single file during the hasing process and the possible re-download.

You not be sure that the local file is the wrong file. I thing that may be implemente as rename local file to sync-conflit, and re-download then file.

Taking example from the operation of RAID1, may be implement a sort of background re-hashing function that recalc the hash of all file. Should be configured the speed of recalc or the maximum CPU load for that funcion. Or a automatic selection based on the time I want to be rehash all the file, than can it adjust the speed and CPU load to give that.

uok · May 24, 2019, 1:55pm

I tried this on Win 7, Office 2010 and both files are identical and have same modification time

luca1234567 · May 25, 2019, 10:19pm

I retry now in Win7 and Excel 2016 and there is, but I not remember correctly.

If you modify, close and click “Not save”, there is not the problem. BUT if you not modify, moreover you only open and close Excel, there is the problem. Try it.

calmh · May 26, 2019, 7:07am

I have no idea what you guys are talking about. Does this have some kind of bearing on the file integrity verification idea?

luca1234567 · May 26, 2019, 9:19am

Yes because this simple operation create the situation where I think syncthing not view the change because the modification time is the same. But the file and hash is different.

Apart from this example, can the “background hashing” proposal be useful?

calmh · May 26, 2019, 9:35am

If you mean excel changes some metadata in the file while retaining the time stamp, sure, that’s a thing that happens. Another typical example of that is music files where editing artist etc often retains the modtime as well and this can briefly confuse syncthing.

Hence,

This is known.

luca1234567 · May 28, 2019, 1:01pm

For this reason, a procedure should be implemented to found this case. I think about:

In the case of notification from Watcher, the hash should always be calculated even in the case of the same modification time. Perhaps optional.
“Background hashing” could be implemented. Can be activated via option.

jim-collier · October 2, 2019, 8:03am

Piping in a bit late. I thought this post title was a good idea.

But I haven’t seen anyone mention the same reason I have, for “why”, even though it seems unlikely I’d be alone.

A couple of years ago I started looked into bittorrent-based file syncing for tranferring the work product of remote video shoots. And more recently, I restarted a similar investigation for maintaining a local backup mirror (in addition to regular cloud backup), for 7tb of data growing at 1.5x per year.

This is probably bias or prejudice, but I was under the impression that bittorrent was not a reliable protocol. Because the two clients I had prior experience with–Transmission and Deluge, both needed a final, manual “verify” step to guarantee something as simple as downloading the latest Xubuntu ISO. (Because if you didn’t do that last step, maybe a quarter of the time it would be a corrupt - perhaps incomplete - ISO image.)

I’ve always been aware that that may not have been a protocol problem, and instead a client implementation problem. Either way, the perception was there, and it gave me extra pause when considering bittorrent-based solutions.

The point is, I’d wager I’m not alone in the perception that bittorrent is a “not 100% reliable” protocol, whether fair or not, and that Syncthing’s (or Resilio’s) guts might be similar enough in some way, as to have the same problem.

In my testing of Syncthing, involving scripted blake2 checksum compares, I noticed no such problems. Syncthing was solid and didn’t have the same corruption problems as the bittorrent clients.

I also use redundant checksummed storage, with redundant disks, controllers, and ECC memory. So I’m not very concerned about the risks mentioned in other comments, about flakey storage.

(And I agree that if that’s a concern, one shouldn’t rely on every [or any] application to verify reliable storage of its related user data, and there are at least two really good solutions to noticing [and optionally auto-repairing] bitrot and other corruption problems.)

For me, it just would have helped ease some incorrect initial prejudices early on in my research, if there was an easy way to manually trigger a bit-for-bit (or checksum) verify. Or better yet, if it was an easily configurable way to tell it to do so automatically, periodically.

I wouldn’t expect this alternative viewpoint to bump up such a feature request in the queue. (I could think of countless more important things to tackle first.) But, for what it’s worth.

calmh · October 2, 2019, 3:00pm

FWIW, Syncthing has nothing to do with Bittorrent protocol-(or other-)wise, and does use SHA256 hashing already to verify that correct data was transferred.

You may not trust that it does so correctly, but then you should also not trust a verification feature also built into Syncthing.

jim-collier · October 2, 2019, 7:03pm

Thanks for the feedback.

Interesting. It might be useful to point that out more prominently early in the documentation. (But since I don’t have the bandwidth or impetus to do so myself, take it for what it is.)

I can’t assume that just because I believed that Syncthing was based on bittorrent for some reason, means others might. But the reason I believed that, don’t seem far-fetched: Resiolio is derived from Bittorrent Sync. “Bittorrent” is right there in their name. Syncthing, judging by various internet tech forums, is widely viewed as a direct open-source competitor to Bittorrent Sync, and has a very similar feature set, technology base, and addresses an identical (or at least nearly so) problem domain.

I’ve read everything on the doc site (granted weeks ago) and I don’t recall this being addressed. A search of “bittorrent” on it yields only one result, in the FAQ section, which just says that Resilio and Syncthing are different - framed in a generic way of proprietary code with unknown security properties, vs open-source.

So…it might help to make it more clear that Syncthing is not based on the bittorrent protocol. (And maybe even a table of how are they similar and how are they different, for technophiles.)

Just a thought. Maybe I’m the only one who was confused.

Post hoc ergo propter hoc. Or maybe just non-sequitur. Either way, flawed logic.

We all deal with software on a near-daily basis, that has some broken functionality, usually minor. (And those bugs are usually marked low-severity if it’s not catastrophic, and has easy, possibly obvious workarounds.) We learn pretty quickly how to work around or deal with them, if that’s possible and it’s otherwise worth it. Often, cobbling multiple utilities together. Bugs are just the nature of extremely complex things involving the work of numerous people, and few people throw the baby out with the bathwater for something that’s otherwise uniquely useful.

As another direct, real-world refutation of that assertion, which in fact couldn’t be more direct - is the example I just gave earlier about Transmission and Deluge. They both routinely resulted in corrupt files. But at least one (if not both - I don’t recall) had a visibly exposed option to “verify integrity”. It only took one corrupt download, before I remembered seeing that option, and figured out I should use that feature. It only took a few successes after that before I realized it was obviously some alternate code path that was reliable, and I should always treat downloads as a two-step process. (In fact I only use it maybe twice a year, and even then it’s easy to remember to do so.)

I also remember my days of pirating mp3s on Napster, and how generally unreliable that was (though not necessarily due to flawed implementation). Anything in roughly the same orbit as p2p file-sharing protocols, is automatically tainted (in my mind and my friends’ at the time at least), with “inherently unreliable”. People are weird like that and the same phenomenon (associating attributes among unrelated things even across time and even if irrationally so) is why advertising works.

But that doesn’t mean an option explicitly worded as acknowledging a “flaw” (real or perceived) and specifically intending to overcoming it, shouldn’t be trusted.

Anyway. It’s probably not important to either of us to split hairs over semantics. I don’t have the bandwidth to improve the Syncthing documentation, nor submit a PR. (Let alone learn Go and the Syncthing codebase.) And I’m almost done with my coffee.

I’m fine just being a consumer of this and take what I can get. I’ve already invested a fair amount of time cobbling together other tools and code to solve my own most pressing and much narrower problem of local mirroring while treating the destination as a random bucket of file-based, checksummed content that happens to have arbitrary metadata like paths, filenames, timestamps, xattrs, etc. (While striving, first and foremost before any other goal, to avoid transferring file contents over the wire.) But Syncthing will remain among the top of my “go-to” list of solutions to consider for future challenges I come across at work or home! (So I’m cheering for it. And this investment of time on the forum only tricks my brain into having a more vested interest in it, in spite of doing nothing to actually improve it. Kind of like a sports fan. )

calmh · October 2, 2019, 8:51pm

I see your reasoning but I don’t want to point out that Syncthing doesn’t use the BitTorrent protocol. I also don’t want to point out that it doesn’t use the rsync protocol, or use ssh as a transport, or have unison format indexes. There are many things Syncthing are not and exhaustively listing them does us little good, in my opinion.