Content Defined Chunking

bt90 · January 9, 2023, 5:52am

Very interesting approach:

calmh · January 9, 2023, 10:02am

I did some experiments with CDC a few years back, and my conclusion is that it’s of limited utility for Syncthing…

imsodin · January 9, 2023, 10:35am

Yep, already in use by borg and restic since forever as well. Our weak hashing when pulling does something similar to find re-usable bits from the old file, it’s just not used for the blocks themselves. Content defined blocks would likely be a small optimisation, as shifted blocks between files might be re-used on sync (no weak hashing there). Probably pretty small benefit, e.g. shifted data within the same file safes 0.9% transfered data according to data.syncthing.net - I’d expect the same between files to be much smaller (likely what Jakob found in the experiments). At the same time file hashing would become more expensive overall, as an additional rolling hash needs to run (might be a minuscule change though as those are generally a lot faster than SHA256 - maybe less so with hardware acceleration, don’t know).

calmh · January 9, 2023, 11:16am

Also we’re very dependent on the chunking being somewhat consistent between two devices or no block reuse will happen. Some CDC algorithms are more deterministic than others, but any gain can become offset by the loss from a file suddenly being chunked differently than it was previously.