File Integrity Verification

This feature request arose from this thread:

The idea is that the user has the option to manually trigger a checksum based file verification process, in case the user suspects anything went wrong with the local data.

After pressing the new “Verify all files” button in the UI, Syncthing would:

  • warn the user that any changes to files during that process my be lost
  • hash every file and check it against the checksum in the index
  • re-download all files where there is a checksum mismatch

Read-only nodes could execute this function periodically via an api call in a cron job. This would also then report broken and fixed files. This could also mitigate bit rot on primitive filesystems.


Issue #1315 on Github already discusses this and the idea was dismissed with the main points made:

  • (A) This is something the filesystem should handle.
  • (B) There is no way to detect if changes were made by the user and therefore should be synced, or if the data should be corrected.

Also, Audrius Butkevicius’ comment from the thread where this came up:

I don’t think verify feature makes sense. If the file changed we’d know (mtime and size changes), if it hasn’t changed, what’s the point of verifying? What are we trying to catch here? Bad drives? That’s not really syncthing’s problem.


I would like to comment about the two points made on Github, as well as Adrius’ comment.

First, the easy one:
(B) Because this would be a manual feature, the user has to trigger it and can be reminded to not change any files during the process.

Now the tricky one:
(A) Yes, you are right. Theoretically, the filesystem is responsible for the integrity of the files. There are really great filesystems out there that can totally handle all these problems.

But lets look at this from a practical perspective for a moment. For most users, Syncthing is a kind of DIY thing. They don’t trust cloud providers with their proprietary software to push all their data to some cloud. Or tech savvy people that use Syncthing for personal home network. Or just office colleagues that want to sync data.

All of these people have no idea about all these fancy filesystems that would solve integrity problems. They are way to complex and time consuming to set up and maintain.

My hypothesis is, that the majority of Syncthing users do not use self-healing file systems, and therefore the argument “that this is the filesystems job” does not hold true practically.

I’d like suggest that the maintainers evaluate this a feature carefully and give it consideration. I urge you to decide what is best for the project and listen to the community - and that may also mean to deny this feature request. I am not trying to push you into this. I am just contributing an idea.

Unless I am mistaken that can already be done, it’s just not a single api call:

  1. Reset the database for the folder in question (https://docs.syncthing.net/rest/system-reset-post.html)
  2. Reset the folder to global state with the revert endpoint (https://docs.syncthing.net/rest/db-revert-post.html or button in GUI)

Now 1 isn’t available in the web UI, but I’d argue it shouldn’t be in general as it’s not something that should be done unless there is a very good reason and it can’t be unless there is a solution to https://github.com/syncthing/syncthing/issues/3876. However I think it would be a sensible request to extend the -reset-database command line option to optionally take a parameter to specify the folder(s) to be reset.

1 Like

The cumbersome way is available in the GUI: remove the folder and add it back (retaining the folder ID etc.).

1 Like

I really love this idea. It would add more confidence knowing that data is bit-identically synced to one or more devices. And it would show or even prevent bitrot or non-ECC protected bitflipping data storages. ZFS-style added data security super convenient. Maybe you can add a per-folder setting: “Full-rescan & rehash after 30 days”. Or for data laying on an internet server: per device full rescan and rehash every 90 days. It’s something you could even manage remotely with Arigi. Rehash and rescan everything. Wait x hours and you can sleep better =).

I actually get to see the value of this more and more when thinking about it. You first need to ensure that the folder is fully up-to-date, then stop any index exchange/syncing during the process, then by scanning first, thus picking up modifications with a change of file modtime and/or size, but not bitrot and the like, then rehashing and dropping any detected changes with data from remotes you can ensure the integrity of your data. However, it’s also a fairly complex and potentially data destroying process (I am sure there is software out there that changes file contents without changing modtime or e.g. does a file modified through a hardlink get its modtime updated?), which I don’t want a user to do unless they really understand what they are doing. And compared to understanding and assessing the risk, doing (scripting) a few rest calls takes less time.

Yes. Modification time is a property of the file, not the link.

There is a ordinary example of different file content, but with the same Modification time. You could try to:

  • Create a .XLS file with Micro_soft Excel, and close.
  • Create a copy of that file
  • Open the file, edit something, and close WITHOUT save. Now you have two different file, but with the SAME modification time.

I see that the first time 10years ago with the version 2000. By chance I can try with the version 2018 and I see the same issue. Personaly a stop to use it from 10years ago.

This example create the problem of this post.


In my situation where there is more then one people that work on the same shared folder, there is not practical to prevent any change to files during process. I suggest to use split that process atomicaly file-by-file and complete it for the single file. And use for example the file locking, to lock the single file during the hasing process and the possible re-download.

You not be sure that the local file is the wrong file. I thing that may be implemente as rename local file to sync-conflit, and re-download then file.

Taking example from the operation of RAID1, may be implement a sort of background re-hashing function that recalc the hash of all file. Should be configured the speed of recalc or the maximum CPU load for that funcion. Or a automatic selection based on the time I want to be rehash all the file, than can it adjust the speed and CPU load to give that.

I tried this on Win 7, Office 2010 and both files are identical and have same modification time

I retry now in Win7 and Excel 2016 and there is, but I not remember correctly.

If you modify, close and click “Not save”, there is not the problem. BUT if you not modify, moreover you only open and close Excel, there is the problem. Try it.

I have no idea what you guys are talking about. :slight_smile: Does this have some kind of bearing on the file integrity verification idea?

Yes because this simple operation create the situation where I think syncthing not view the change because the modification time is the same. But the file and hash is different.

Apart from this example, can the “background hashing” proposal be useful?

If you mean excel changes some metadata in the file while retaining the time stamp, sure, that’s a thing that happens. Another typical example of that is music files where editing artist etc often retains the modtime as well and this can briefly confuse syncthing.

Hence,

This is known.

For this reason, a procedure should be implemented to found this case. I think about:

  • In the case of notification from Watcher, the hash should always be calculated even in the case of the same modification time. Perhaps optional.
  • “Background hashing” could be implemented. Can be activated via option.