There is one significant use case that I cannot yet use Syncthing for: keeping an off-site backup of my on-site backup. My backup is a simple filesystem sync made with rsync, and between snapshots, there are a lot of hard links. Using Syncthing as-is would multiply the size of the backup repository by … I don’t know … a lot. Of course, I could “just” do another rsync offsite, but … Any hope hard links could be a feature in a future release?
If it isn’t obvious: I won’t move off of rsync to be compatible with volumes that don’t support hard links.
I realize that adding a single volume that does not support hard links to the peers would break everything. I also realize I’m wishing, but I would like Syncthing to check if the local volume supports hard links and, if not, simply refuse to synchronize with a shared folder that has hard links enabled (so yeah, there would hypothetically be two kinds of folders now, with and without hard links). I’d even settle for running a hardlink-supporting-volume-only fork of syncthing. Maybe I’ll try vibe coding that someday.
Very unlikely given all the potential complications and the niche nature of it. Maybe a copy on write filsystem and enabling the corresponding options in Syncthing might be a valid workaround.
I’m slightly confused by this , because in theory you ought to be able to map a hard linked file to an already existing / hashed file path , right ? Is there something preventing a hardlinks lookup table to avoid re-hashing a known hardlinked file?
At least on Unix-like filesystems, each file has an inode number and a link count, yes ? If two paths point to the same inode (i.e., hard links), you can store the hash once keyed by inode, and re-use it for all hard links. That would let Syncthing avoid re-hashing unchanged hard-linked files, saving CPU and I/O. You can optimize db storage by only ever storing inodes that have more than one reference so you don’t store extra inode data in the db.
I’m sure i’m missing implementation details.. but i’m not sure why it wouldn’t be something able to be handled in the same way rsync solves the problem ?
P.S. I also got led here after realizing Synctthing is failing me for more or less this same use case as the OP pointed out .
This is far more complex than it might sound at first. Some optimizations are certainly possible, but not everything and it gets complicated really fast.
For starters, there’s the obvious cross-platform problem: Is this solution Linux-only, or should/can it also supported on other platforms? If we leave that aside, we quickly realize that userspace has very little inode information in Linux.
The inode information reported by e.g. stat in Linux is passed through as reported by the filesystem. This has several implications:
Firstly, inode numbers are only guaranteed to be unique per-filesystem: So when talking about comparing inode equality, you always scope yourself to the same filesystem: Figuring out if two paths are on the same filesystem is in itself not entirely trivial, and it is something that can change at any moment due to path mounting. You always open yourself up to TOCTOU races that cannot be fixed from userland.
Next, there is no requirement for filesystems to report unchanging/consistent inode numbers: A filesystem may recycle inode numbers as it pleases. For example, if a file (and its associated inode) has been deleted, the filesystem is free to re-use that inode number for a new file. So just because two files are on the same filesystem with the same inode number at different points in time that doesn’t mean that they’re the same file now: This is only valid if you’ve re-checked that the inode numbers are still the same at this point in time, which again, is subject to TOCTOU races.
Likewise, a filesystem may choose to be stateless, and report different inodes every time for the same file (some network filesystems do this). The Linux kernel itself is protected against inode confusion by using internal inode ids (assigned by the kernel) for which the filsystem is required to provide consistency as long as the kernel has caches for that file. However, these “internal inode ids” are not exposed to userland, and are in-memory only: They change on every reboot.
You can use inode ids for short-term sanity checks: If you stat two files within a very short timeframe, and both report the same metadata (inode id, file size, timestamp) you can assume that they’re the same file with a reasonable probability (still possible TOCTOU, but small). This is what’s typically done. What you cannot do reliably is use the inode id for any kind of “this is still the same file” long-term: That doesn’t fly from userland, too many things can change without the software knowing about it.
What you may be able to do is scan a file, remember its inode metadata, and then if you see the exact same inode metadata shortly thereafter there’s a certain probability that you’re looking at the same file (though again, never guaranteed). The longer the time between the comparison, the higher the uncertainty though.