Be tolerant of moves, rename, and copies, without 'net transfer, based on checksums

jim-collier · October 1, 2019, 10:32pm

Currently, Syncthing cannot handle moved or renamed files, without transferring them over the network as if they never existed on one side before.

In my case (and clearly for others based on searching the forum and github page), that makes it a not-viable solution for syncing large amounts of data, when high-level folders containing GB or TB of data, can be and often are renamed. (In my case it’s part of my workflow. Also in my case, I’m only concerned with one-way sync. aka mirror, but that’s not necessarily relevant to this request.)

Some utilities, like rclone with --track-renames can do this, based on file content checksums. Also, chunk-based backup utilities do it, while a very different problem domain, by largely not caring about any notion of “files” anyway, at least beyond connecting paths to chunks in a database, and restoring correctly.

Even trusty old rsync can even be convincingly tricked into being fully tolerant of moves and renames, with surprising ease, via a wrapper script that updates “shadow” directories - such as ‘hrsync’.)

What they have in common, is caring more about checksums, than metadata (i.e. filname, directory location, timestamps, security, etc.). The former deals with whole file content checksums, the latter with chunk content.

Syncthing could be far more efficient (in this particular way at least), if it paid more attention to file content checksums.

For example, let’s say we have a one-way sync situation set up, for simplicity.

FileA.doc exists on both source and target.
A human renames source/FileA.doc to source/FileB.doc.
Syncthing:
- Source: Notices (via periodic scan or inotify) that FileB.doc newly exists, so it scans content for checksum and stores it locally.
- Target: Notices that source/FileB.doc exist but target/FileB.doc doesn’t, so it checks for existence of source/FileB.doc’s same content checksum anywhere on target.
- Target: Finds target/FileA.doc with same checksum as source/FileB.doc.
- Target: Notices that source/FileA.doc no longer exists.
- Target: Renames target/FileA.doc to TargetB.doc, and syncs other metadata to match.

No file content needed to be transferred, only a single checksum.

Notice that it doesn’t matter what, where, or why target/FileA.doc originally was. All we care about is that 1) FileA.doc no longer exists on source, 2) source/FileB.doc exists on source but not target, and has the same checksum as target/FileA.doc.

In other words, imagine both source and target filesystems as buckets of not necessarily unique content and oh by the way, each with corresponding metadata.

Then, similar logic can be applied to all kinds of operations, not just 1:1 renames. For example:

Move: Same logic as rename, except the metadata being changed is the containing folder, not filename.
User makes a copy on source side, file “A.doc” to “A (copy).doc”: Since the checksum of “source/A (copy).doc” already exists on target (because “target/A.doc” had alredy been synced before), the copy can be performed solely on target, without transferring content over the network. (Or even better, if copy --reflink=always is supported on target, then the copy can be near-zero-cost in terms of time and storage, while still maintaining all unique metadata and ability for content to diverge later.)
File is renamed on source, but it’s ambiguous which file that was on the target: It doesn’t matter. Since we’re focusing only on content checksum, and metadata like paths are ancillary things that need to either be copied, changed, or created - either just do a copy only on source (based on checksum), then delete whatever needs to be after that, or rename the first file encountered (or a random one) that has the same checksum, but exists on source and not target, then deal with the rest either via copies only on target, and/or deletes.

Nummer378 · October 1, 2019, 10:48pm

You should make yourself familiar with how syncing works in Syncthing. It’s block based, meaning that files consist of blocks which are transferred. Blocks/files are hashed, so much of the things you mention are already implemented (for example, duplicate blocks are only transferred once). On the statistics page you can see that quite some data is re-used by renaming or reusing blocks from other files.

The thing is, there are some issues why moved files don’t always work as expected (they work a bit like “duplicate first, then delete old file”). For me, it usually works without a re-transfer. The specific issues that Syncthing currently has with renames/moves are best explained by someone else, not me.

AudriusButkevicius · October 1, 2019, 11:27pm

Large directory tree renames (more than 250 items) might end up as copies, but that’s mostly because people want downloads to start asap, that is before the scan is even finished.

Identifying during a scan that a new file that appeared might happen much earlier than detecting that an equivalent file is gone (implying a rename), and because the new entry is sent asap, other devices end up making a copy and a delete sometime later, oppose to a rename, as the delete might arrive much much later.

And I agree, it was probably worth investigating in a few minutes how syncthing works, as now you spent quite some time writing up a comprehensive proposal for something that (in my belief) is mostly there.

jim-collier · October 2, 2019, 7:36am

I did. In fact I read every word of the documentation on https://docs.syncthing.net/, and quite a bit of github issues and forum posts here. (As part of a months-long research project involving multiple projects.)

And I tested it, exhaustively and objectively, for several days. In no case did I find that a file or folder rename or move, did not go through a transfer-then-delete cycle. I tested with a little bit of data, I tested with a lot of data. GBs and TBs. I tested it with pre-seeded data, I tested it with an empty target.

I understand how it’s supposed to work. But no matter what I tried, it never was able to avoid transferring files over the wire. Even for simple renames in the same folder on the source side.

The version of syncthings tested:

syncthing v1.2.2 "Fermium Flea" (go1.12.9 linux-amd64) deb@build.syncthing.net 2019-08-15 13:51:09 UTC

I had my source configured as send-only, and target configured as receive-only. (It’s been a couple of weeks since I wrapped the tests up and moved on, so I may be misremembering the specific terminology.)

My use-case is pretty simple: Keep a one-way sync to a local mirror reasonably up to date. Like I said, part of my workflow involves regularly moving and/or renaming folders with GB to TB of content underneath. I’ve already moved on to a solution that handles it very well.

I just thought Syncthing was a pretty neat piece of software other than not working for my requirements, in spite of the documentation (and many post comments like these replies) suggesting it does. So I thought this might help. And granted it’s been several weeks if not a couple of months since I read through the technical documentation, and yes, now I do recall that it is block-based therefore a file-based checksumming approach doesn’t necessarily make as much sense.

Either way, I wasn’t sure whether to flag it as a feature request, or bug. Because variations on the problem I was seeing, as a bug report or question, have already been posted numerous times.

Thanks for the feedback.

AudriusButkevicius · October 2, 2019, 7:44am

You haven’t specified the exact steps of what you tried and the steps you used to verify it did not work, so sadly there is not much I can help reason about here.

Yes, there are probably edge cases where it does not work, but that should be an exception, not the norm.

If you want someone to explain this properly, I suggest you produce exact steps of reproduction, as now it’s not clear what you did and how you verified it.

Reusing blocks from other files (avoiding network transfer) has close to no distinction in the UI and the filesystem, making it look like its transferring stuff over the network, when it’s not. Perhaps that is what you saw and were confused.

jim-collier · October 2, 2019, 8:21am

Definitely not. I throttled the network, watched and logged the packets on both systems, and dissected both syncthings logs after every test. And with a file watcher I could literally see new files being slowly written, then old ones deleted, for every rename.

I think you could empathize that it’s not worth it for me to reinstall and reconfigure everything (which as you know can be be quite time-consuming). Like I said, I wouldn’t change back anyway now that I have an alternative solution that works well (and took forever to find, test, and script). I only logged this because it’s a brilliant piece of software that I’d like to help in some way. (Even though this has proved to be the opposite.)

So, you can consider this either “unfair criticism” because I can’t be bothered to invest many more hours to set up up my testbench again; or “unhelpful” because I’m saying it didn’t do what you’re saying it should have, but I didn’t keep the log outputs etc. to prove it.

Fair enough, so take it with a grain of salt, and don’t waste any more time on it. I’m sure I’ll find a different use for Syncthing some time in the future (including possibly rebooting the first round of research I did on bulk video transfer that I wound up having to punt on and use SFTP).

It’s an impressive piece of software, keep up the good work!

jim-collier · October 2, 2019, 7:27pm

I should also add that I believe your description of what should normally happen, and how/why. (You’re a maintainer, why would I not?) I also accept your implicit suggestion that I was doing something incorrectly - either in the test execution, the observations, and/or analyses. (The former seems much more likely.)

Maybe this might just boil down to a “wish” of making “never transfer data over the wire if at all possible” a higher, or highest priority. Possibly involving scanning for checksums on both/all sides earlier in the process, and/or more frequently, aggressively, and/or making it a prerequisite for any other operation.

I readily acknowledge I’m less informed or even ignorant of many other use-cases for Synthings, beyond my own narrow case of maintaining a one-way mirror that doesn’t have to be real-time as long as it eventually happens on its own. I’m also aware that I’m willing to trade frequent and lengthy read cycles to generate checksums, for less data transferred over the wire. (Though presumably, files aren’t scanned for checksums again unless size or mtime changed since the last scan. But just generating a list of files, and then inspecting those properties, can be pretty lengthy and intensive. And in my view, inotify - with adjustments - might be a useful adjunct, but not a reliable replacement for, regular scans. Again, for my use-case, that’s all a worthwhile tradeoff.)

Also, thanks!

iostrym · October 6, 2022, 7:11am

Hello what is the conclusion here ? Does it work or not ?

I did a rename of file from my phone and in the ‘new change’ of the web gui, I see the file is deleted and a new renamed file is added. So it seems that the rename wasn’t detected by the tool am I correct ?

calmh · October 6, 2022, 7:22am

No, that is how a rename is represented. In most cases it will happen without transferring any data.

iostrym · October 6, 2022, 10:34am

Ok thanks !

In a previous post it was said that sometime it use data if user don’t wait for the complete scan and ‘force’ to synchro.

Is it really possible using the android app ? Is there a risk that android app will delete file without checking that the file has been moved in another folder ?

I don’t want to active this by error so I would like to know of to force the synchro like this (so that I don’t do it)

calmh · October 6, 2022, 10:49am

I don’t think that is something that you can do.

iostrym · October 6, 2022, 11:06am

So I don’t understand this

AudriusButkevicius · October 6, 2022, 11:24am

Whats your actual question

calmh · October 6, 2022, 11:26am

If you rename a file, it will be handled efficiently. In some cases, if you rename an entire tree of lots of files, it won’t manage to handle it as efficiently. That falls under “in most cases” above.

iostrym · October 7, 2022, 8:09pm

I don’t understand why you said that " people want downloads to start asap, that is before the scan is even finished."

As if they can do it ? Can they ? Can people force the start of synchro before the scan is finished ? And have files transmitted again instead of being renamed ?

AudriusButkevicius · October 8, 2022, 9:03am

Yes, providing data, and scanning are not exclusive operations. Downloading data Nd scanning are exclusive. You can’t force anything.

But if someone has a large folder, where they moved/renamed all of the files, they want the downloads to start as soon as the scan finds something, i.e., we detect effectively a “new” (in terms of path) file, and announce it to other devices immediately, so that other devices start downloading it.

Detecting renames would require not announce anything until you’ve scanned all of the files in order to be able to identify renames.

iostrym · October 9, 2022, 6:58pm

What I don’t understand in your answer: Is the download done before the scan or not ? I have the feeling that your are saying it could be possible to have a download before the scan is possible .

calmh · October 9, 2022, 7:01pm

Concurrently. If there are a thousand files to scan, information about the first 100 might be sent out while we’re still scanning the remaining 900, and those 100 start downloading while we’re still scanning – before we’ve come to the part where we notice that 1000 files were also removed and this was all a rename.

iostrym · October 9, 2022, 7:05pm

Ok understood and if it happens there are problem or not ?

AudriusButkevicius · October 10, 2022, 5:37pm

It’s designed to work this way, so why there would be problems? It does however explains why not all renames are detected.