Better delta sync algorithm (like rsync)

calmh · October 20, 2014, 7:11pm

So I’ve spent some time thinking about this. Does anyone have any real use cases for where this matters today? Basically I can see the following cases:

A file is appended to as it’s being worked with. Handled efficiently today. Examples: log files, journal files, …
A file has random blocks changed within it. Handled efficiently today. Examples: VM disk images, database files (together with append as above), …
A file is completely re-encoded by any change. Examples: photos and movies, compressed or encrypted files. Nothing we can do here.
A file has information inserted at some point and the following content moved forward. Not handled efficiently today. Also, not efficient to start with. Examples: Word documents and similar.

Basically, as far as I can tell, the last case only happens for small files where you’re anyway rewriting the whole file, because it would be prohibitively expensive to do the change at all otherwise. For rsync, that’s relevant because it’s like 30 years old and transmitting a whole 100 KB file just because someone added a few bytes to the beginning then was a big deal. Today? Meh. Just send it.

Counter examples?

(If anything, I think using a Merkle tree and having the leaf blocks be smaller (4 KB?) might be a larger win. It would also make the full index exchange more efficient since only the top level hashes need to be exchanged initially.)