Content Defined Chunking

Very interesting approach:

1 Like

I did some experiments with CDC a few years back, and my conclusion is that it’s of limited utility for Syncthing…

Yep, already in use by borg and restic since forever as well. Our weak hashing when pulling does something similar to find re-usable bits from the old file, it’s just not used for the blocks themselves. Content defined blocks would likely be a small optimisation, as shifted blocks between files might be re-used on sync (no weak hashing there). Probably pretty small benefit, e.g. shifted data within the same file safes 0.9% transfered data according to data.syncthing.net - I’d expect the same between files to be much smaller (likely what Jakob found in the experiments). At the same time file hashing would become more expensive overall, as an additional rolling hash needs to run (might be a minuscule change though as those are generally a lot faster than SHA256 - maybe less so with hardware acceleration, don’t know).

Also we’re very dependent on the chunking being somewhat consistent between two devices or no block reuse will happen. Some CDC algorithms are more deterministic than others, but any gain can become offset by the loss from a file suddenly being chunked differently than it was previously.

1 Like