Could Syncthing produce hard-links?

A quick idea that just came to mind. Sometimes there are shared folders with overlapping sets of connected devices. For example one shared with my family and one with my in-laws, where me and my wife are in both groups. When I share pictures from an event via both shared folders, my wife will also get them twice on her devices.

So when pulling a file from a remote, Syncthing currently notices if the blocks are all the same as for a file in the other shared folder. Then it will copy the existing data to a temporary file and when done, rename it to the final path. How about taking a shortcut here and simply hard-linking the final path to the existing file (if on the same filesystem, otherwise fall back to copying). This would save half of the required space on disk.

If a remote then changes only one of the two identical files, Syncthing assembles a new temporary file and renames that to the final path, removing the previous hard-link. Kind of a copy on write behavior.

Of course this should be an opt-in setting per folder. And with a clear warning that it will create local files that are possibly linked to others and in-place local modifications will affect all linked duplicates. The obvious race conditions (file not completed in one folder, thus the second won’t pick up the existing duplicate) could be mitigated by applying the hard-link logic also during regular scans, to deduplicate after the fact.

Of course, the existing workaround is to run a duplicate file scanner with hard-linking capability over all folders regularly. I haven’t tried though what Syncthing will do when a file is replaced with a hard-link to another identical file. And it might be appealing to re-use the scanning and block hashing of Syncthing to save cycles and I/O instead of doing it independently.

Would that make sense or do you see any obvious problems?

I think the biggest risk is you don’t realize one file is a link to the other. Then you modify one file without realizing you modified both files…. Or you delete the “real” file on the real system without realizing that another file links to it.

I suppose you’re really looking for disk deduplication. It’s interesting because I proposed a disk deduplication related suggestion earlier but the difference was it was only for encrypted folders which are not likely to be edited on the encrypted server.

Anyway if your proposal was implemented I wouldn’t use it for the reasons I mentioned in the first paragraph.

In addition to the above we’d need to handle cases where contents are the same but metadata isn’t. Presumably some metadata would be “don’t care” (timestamps) while others (permissions, ownership) might be quite relevant. Depending on what we’d do there that would also affect future operations (we can’t just change permissions on an existing file that’s hardlinked somewhere else, we’d need to then make a copy, etc).

(All of this feels like it could be better handled by an external deduplication tool, IMHO.)

See also our copyrange stuff which does more or less all this already for supported filesystems, while avoiding the associated pitfalls.

2 Likes

Another wrinkle is that there’s no guarantee that all shared folders are always on the same storage volume. A hard link can only point to another file/directory within the same filesystem, so things could get messy handling exceptions (especially in the UI).

Thanks for your consideration and responses. Seems like there are more pitfalls than possible gains.