Noob question - incremental sync?

Couldn’t find any info so I’ll ask it straight - is synchronization incremental? I.e. if my 2 GB Outlook file receives a new e-mail, does the whole 2 GB + 1 kb get synced? Thanks, iM

Depends on how the outlook file format works. Hopefully not.

Thanks for the quick reply. My question doesn’t concern Outlook as such, it’s more general. Does Syncthing work along the same lines as Rsync, i.e. it only transfers the changes to any given file? iM

Yes, but it depends on how those changes are done. If data is appended to the file or changed in the middle of the file, it’s fully incremental. If stuff is added to the beginning of the file and everything is shuffled backwards, it’s not. But then that requires a full rewrite of the file to do the change as well, so it’s a pretty stupid file format if that’s the case.

1 Like

OK, good to know. Thanks! iM

Have you given any thought to using something rsync-like for syncronizing large files in future?

No, not really. This was obvious and simple and I haven’t heard a compelling use case or seen patches for anything else. :wink:

Is there any update on this topic?

I’m desperately looking for incremental sync.

The answer is as it was: take a look at your file, and split it into blocks of 128kb. Now change the file. Only those blocks which have been altered will be synced.

This means that appends and in-file modifications will be synced in an incremental way. Changes which move data around across the whole file will not be synced in an incremental way.

That’s quiet disappointing.

Rsync and Dropbox can detect where in a file the change occurs and can insert it at the right position when syncing. That’s much more effective. Think on a TrueCrypt-Container that needs to be synced: the current algorith will have to sync the whole file in almost any case :frowning:

IIRC TrueCrypt contains completely change on any save? That’s kind of the point of a secure container. So not even Dropbox/Rsync would be able to cope…

That’s not how Truecrypt behaves. It’s a disk image like any other, except encrypted. If you write a 1 MB file to it, you’ll get 1 MB of changed data (plus some blocks of filesystem metadata etc) which we’ll transfer as efficiently as anything else.

I still see this as a theoretical thing, with only Photoshop files so far being mentioned as an example of something real world that would benefit from rolling checksums, and then only in corner cases.

And rolling checksums are by their nature something peer-to-peer, not really easily implemented for a cluster such as syncthing.

@GeeGee What’s the actual use case that you are desperately looking to solve?

As I moved from truecrypt to encfs due to the problem mentioned here, it now only applies to PST-files from Outlook. The rsync-algorithm is available to the public. Why isn’t syncthing adopting something that is working very well?

Because rsync uses rolling hashes, which are not cryptographically secure, which is one of the properties which syncthing requires in order to prevent spoofing/DoS. We’d need a cryptographically secure rolling hash, which most likely means inventing our own crypto, which is never a good thing.

Plus, rsync relies on variable block size, which would make it very hard for synchting to maintain an index. There is already a large discussion about this in another thread, I am sure if you search for rolling hash you’ll be able to find it.

According to what I can see, PST files use a block based database structure that should be very well suited to our sync algorithm. Again, do you see an actual problem here, in practice?

What do you mean with

rsync uses rolling hashes, which are not cryptographically secure

When is a (rolling) hash function cryptographically secure?

When it’s designed to be.

I am not aware of a cryptographically secure rolling hash. Though you could implement rsync like comparing, it would just be a largeamount of work.

Well, ok. What I don’t get though, is why that is important for spoofing/DoS. Aren’t two connected syncthing devices supposed to trust each other?

It’s (I guess) more todo with the likeness of collisions, given we trust a 64 byte hash for each 128kb worth of data, for potentially terabytes of it in total.

The point is more that it prevents working from a known index and distributing requests among peers. Syncing a file using a rolling checksum is an operation performed by two participants who both read through the file at the same time and reporting their findings. Also it’s not necessary as the current setup works perfectly fine for all so far mentioned use cases.