Better delta sync algorithm (like rsync)

inquisis · September 22, 2014, 11:54pm

Hello !

If a file is modified, then only changed blocks will be updated, but it seems that if we add some data at beginning, so whole file will be uploaded.

Suppose this :

# I generate a file (40 MiB) in my sync directory.
dd if=/dev/urandom of=rand.bin bs=4096 count=10000

I wait synchronization finished on my other comp…

# Then I add 2 bytes at the beginning of the file.
sed -i -b '1ia' rand.bin

Now, the synchronization process will be long. I think in this example the whole file is uploaded, not only the 2 added bytes in rand.bin.

marclaporte · October 19, 2014, 9:32pm

Maybe it should be logged as a bug report or feature improvement request? I added a reference here:

Thanks!

AudriusButkevicius · October 19, 2014, 10:32pm

~~I think rsync would suffer from the same issue, as it uses the same approach of splitting files into small blocks, hashing the blocks and then transferring only the blocks which are different.~~

Syncthing/Pulse works the same way, it’s just that the block size is fixed, whereby rsync makes some decision about the block size based on the file size, which makes it probably more optimized, plus it’s much faster as it uses M4 rather than SHA256.

~~I am saying this after reading the following: https://github.com/librsync/librsync/issues/6~~

Perhaps it’s a bit more clever with how it computes the blocks, and given a small block size and a fast hashing algorithm can afford to hash every mismatch block N times by advancing 1 bit at a time, where N is the block size (or perhaps does is backwards and forwards to detect offsets), but with SHA256 I guess your CPU would kill itself before it would rehash the same 128KB block 1<<20 times.

~~A python implementation of rdiff for reference:~~

github.com

therealmik/pyrdiff/blob/master/pyrdiff.py

#!/usr/bin/python3 import hashlib import math import binascii import sys import os DEFAULT_BLOCKSIZE=2048 DEFAULT_MD4_TRUNCATION=8 # lol RS_DELTA_MAGIC=0x72730236 RS_SIG_MAGIC=0x72730136 def log2(i): return int(math.log(i, 2)) ############################## #### Checksums and Hashes #### ##############################
This file has been truncated. show original

Edit: It uses a rolling checksum to detect diffs in between blocks, can’t do that with SHA256.

AudriusButkevicius · October 19, 2014, 10:36pm

But then again, syncthing/pulse has other advantages, such as being able to reuse blocks from other files. So if you create rand.bin, copy rand.bin to rand.bin2 and append (not prepend) a byte to rand.bin2, only a single 128kb block would get transferred, where as with rsync it would have to transfer the whole file unless you explicitly would specify one as the copy of the other.

cwsights · October 20, 2014, 4:15pm

My guess is that being able to send an incremental update of a changed file is going to be more useful than being able to copy the same block from a different file.

Consider files that people are syncing rather than just transferring once: They are often files that are being edited and people want them to be updated b/c they might edit them somewhere else. And most of these files aren’t going to be edited just at the end.

In what use case are there going to be files which are identical except for their end blocks? Seems rare, but maybe I just have no imagination!

Hope syncthing becomes a great synchronizer! It is promising.

AudriusButkevicius · October 20, 2014, 4:22pm

It does send an incremental diff… As long as you are appending or replacing data, which is probably most of the cases I ever do… Even for VM images I usually replace some bytes in the middle of the file, which syncthing handles well.

I don’t think there is anything we can do, as ~~there isn’t a cryptographically secure hashing algorithm which supports rolling hashing~~ it would be quite hard to implement something that gives best of both worlds.

It’s not impossible, but it’s not what BEP (block exchange protocol) was designed for, as there is no longer blocks, just rdiff deltas…

calmh · October 20, 2014, 7:11pm

So I’ve spent some time thinking about this. Does anyone have any real use cases for where this matters today? Basically I can see the following cases:

A file is appended to as it’s being worked with. Handled efficiently today. Examples: log files, journal files, …
A file has random blocks changed within it. Handled efficiently today. Examples: VM disk images, database files (together with append as above), …
A file is completely re-encoded by any change. Examples: photos and movies, compressed or encrypted files. Nothing we can do here.
A file has information inserted at some point and the following content moved forward. Not handled efficiently today. Also, not efficient to start with. Examples: Word documents and similar.

Basically, as far as I can tell, the last case only happens for small files where you’re anyway rewriting the whole file, because it would be prohibitively expensive to do the change at all otherwise. For rsync, that’s relevant because it’s like 30 years old and transmitting a whole 100 KB file just because someone added a few bytes to the beginning then was a big deal. Today? Meh. Just send it.

Counter examples?

(If anything, I think using a Merkle tree and having the leaf blocks be smaller (4 KB?) might be a larger win. It would also make the full index exchange more efficient since only the top level hashes need to be exchanged initially.)

AudriusButkevicius · October 20, 2014, 8:06pm

That doesn’t solve this specific issue though, and still makes the index exchanges expensive given something does change, as you’ll be sending multiple levels of hashes waiting for acks to see if it matches, processing more requests for the next level and so on in order to work out which branch of the tree has changed.

I guess we could build an protocol based on rdiff deltas with what we already have… but I’d prefer the current implementation, as you can do much cooler stuff this way (such as encryption).

calmh · October 20, 2014, 8:51pm

Yeah, I think the protocol is fine as it is mostly, I was more thinking if the goal was to minimize the data transmitted, 128 KB blocks may be a bit on the large side. But I’m still not convinced this issue (the one requiring a rolling checksum) is worth caring about today.

Cyphase · October 21, 2014, 12:31am

I was planning to post a topic proposing Merkle trees, but I guess you’re already aware .

elsdoerfer · November 23, 2014, 11:35pm

The topic on cryptography stackexchange is way beyond my understanding, but tarsnap has variable-sized blocks and is hopefully secure. My understanding is that it simply uses a deterministic algorithm to determine the block end based on the data, and then hashes with whatever function desired. Maybe I don’t understand where the security issue actually lies.

I always hated Dropbox not having rolling checksums, and my specific reason are Photoshop files. Say you have one of them that is 100 MB in size, which I don’t think is uncommon. Simply renaming a layer will shift the bytes, requiring me to reupload 100 MB on a DSL3000 connection. On the other hand, this is not unreasonable for Photoshop to do on the client, especially in the days of SSDs.

If it can be done securely, I don’t see a good reason not to support it.

rwc · January 20, 2015, 8:54pm

There are several issues here that are being conflated:

Dividing data into blocks.
uniquely identifying blocks
indexing blocks by hash

I think the original suggestion was to use an rsync like rolling checksum to accomplish block splitting (item 1). After you have the blocking that you want, you can use any algorithm desired for checksuming - sha256, etc (item 2). You use a cheap (and very very fast) rolling checksum for splitting while still identifying blocks by a more expensive, cryptographic hash. Merkel trees are also somewhat orthogonal to block splitting and block identity. These three things can all be used together or not at all.

I myself believe using a rolling checksum for block splitting would be immensely useful and would work very efficiently/well for syncthing.

AudriusButkevicius · January 20, 2015, 9:47pm

That actually would work but would require a change in the protocol. The file index should contain two sets of blocks, one set of SHA256 blocks, and one set of rolling hash blocks. Then the receiver could assemble the file by doing rolling hash checks for blocks which it yet doesn’t have.

rwc · January 21, 2015, 2:39am

Actually, I think it would be simpler than that. You wouldn’t even need to keep the rolling sums. Once you have the new variable-size blocking, everything else can be the same. just compare sha hashes of all blocks from sender and receiver. send over missing blocks - just like you do now already. I’m actually tinkering with such an implementation now. Other than protocol bump, are there other areas of the code base that wouldn’t play well with variable sized blocks?

AudriusButkevicius · January 21, 2015, 8:32am

Well, even rdiff uses a fixed block size, it’s just that the block size is proportional to the filesize.

I guess you would work out which new blocks have been inserted at the sender side by keeping sha256 + rolling hashes indexed locally and then just inserting the blocks of potentially smaller size in the missing places which were not found by a rolling hash?

I know there are a few places that rely on block size but there shouldn’t be too many. You can grep them out by protocol.BlockSize.

I think it’s mostly stuff like file size approximation.