Is it okay to "seed" a sync?

Hey, gang, I’d love some input on whether it’s okay to manually “seed” a couple of folders for Syncthing before the initial sync.

I’m attempting to speed up an initial sync of a ~700GB folder and save some bandwidth.

Is it okay to manually rsync the files from the source machine to a external drive and then rsync those files to the destination machine?

I’m hoping this will save me some bandwidth during the sync process and I’ll only have to worry about incremental changes.

Background

Computer 1: Source machine. Duplicati backup computer for my Dad’s house. Send only Syncthing folder with ~700GB of files.

Computer 2: Destination machine. Offsite destination at my house. Receive only Syncthing folder.

Both machines are running Syncthing v1.1.1 and Ubuntu Mint 18.04 operating system.

Yep

Just don’t be alarmed that it will still do a huge “sync” at the beginning: A priori everything will be considered different and it has to compare all the files to detect that they are actually the same.

1 Like

Indeed. And, a one millisecond timestamp difference between two otherwise identical 1 GiB files will show up as 1 GiB of data to sync. If the data is in fact the same nothing will be transferred, though.

1 Like

Thanks for the input! I’m hoping an rsync will help preserve timestamps.

In situations where you’re not sure timestamps will be preserved, or when dealing with possible differences in hashing size (large blocks vs small blocks) or when databases need to be recreated, I have found it can occasionally be helpful to set up one device as the ‘master’ with everything as it should be, and then to copy all the files to be synced to a ‘local’ share on the other devices. What I mean by a ‘local’ share is a shared folder set up on each device which is not shared with any other device. These are still hashed, so if the contents are the same both in the ‘master’ copy and in the ‘local’ share, hashing will reveal this and prevent the transfer over the network of this same file. However, Syncthing will still locally copy these (on the filesystem) meaning that the appropriate metadata such as permissions and timestamps will be set as determined by the master copy. I have to occasionally do this, and unfortunately with very large shares you do lose some time performing the local filesystem copy, but it is still faster than a network transfer. Just a technique to keep in mind in case the need arises in the future.

(Side note: local filesystem copying may be avoided in the future, potentially; a feature request is listed on GitHub for cp --reflink support for filesystems such as ZFS and BTRFS which would avoid physically having to make a copy and allow this step to be nearly instantaneous, also avoiding wasting disk space)

Another option is to set the known good side to send-only. Let the devices connect and hash things out. At the end there’ll be a bunch (possibly) of metadata only changes that are refused on the send-only side. Hit override and let it bake. Things are now good. Change modes as desired.

This would help in the case of syncing a change to an existing file in a folder, if we reflinked the file instead of copying it outright and then truncated/overwrote blocks/etc.

It can’t help in the scenario you proposed (temporary local folder) as we never recycle files from other folders, but copy individual blocks from where we happen to find a hash match. (ZFS style block level dedup would help but that feature is ugh and disk is cheaper than RAM mostly…)

The temporary local folder can still help with certain scenarios where nothing else quite fits. Take for instance having a massive file (say a 150 GB one). Now say that it’s in a ‘legacy’ shared folder, with most devices set to use small blocks. Now let’s say you’re expecting each device to change the file just a little bit at some point in the near future. Because of the issues with block sizes, it may be reasonable to reflink the share (or this file) into a temporary local share set to use large blocks. Syncthing will then hash it, and effectively keep both the small block hashes and the large block hashes on hand. So when another device changes the file, the file need not be transferred in full, regardless of whether the device that changed it used small blocks or large ones.

Another scenario is when all sides have the contents of files, but some have inaccurate metadata, and uptime is important. If, say, five devices are all up‐to‐date but two new ones are to join the share, simply adding them would cause lots of metadata changes to take place that you may not want. But by the same token, you might not be able to set the others to send‐only because it is important that they continue to work and synchronize correctly while the new devices are set up. If a device ends up with a broken database and needs to have it be recreated while the share must continue operating for the other devices, such a solution may also be appropriate given the right conditions.

I have scenarios such as these happen occasionally (particularly this past year), and using the temporary local share has been useful, so I just wanted to put it out there as an option. Syncing afresh takes my devices over a month to complete, so any shortcuts to allow the integrity of data and metadata to be preserved while speeding up sychronizing and not sacrificing uptime are important to me.

I’m sad to hear that the enhancement for reflink support won’t help with this scenario. I have situations like this pop up occasionally and I have had devices spend weeks of uptime simply copying files between folders so I really hoped this solution might obviate the need for all that useless physical copying. If the opportunity should arise in the future to detect duplicate files in such a way that they can be reflinked, I would really appreciate it!

Cheers

I think reflinking means something else in this case, ala not hard links. Under zfs/btrfs and other cow fs’es it would be able to copy parts from other files without taking up the space and without spending any time doing that, by saying this file contains part of this block that happens to be in another file. Yeah, under zfs it’s probably extra ram, under btrfs it’s just a few extra entries on the on-disk index.

There is a prototype PR open implementing this, yet in practice I did not manage to try it out if it works on one of the exotic fs’es.

A reflink, as far as I can tell, is a copy-on-write hardlink on a file. So when we modify a file this would be great - we reflink the original file, creating a zero space & time copy, and overwrite the new blocks. Done & done.

But it is on the file level, so we can’t reflink ourselves to random blocks from ten different files when reusing data from other files. That’s all I’m saying.

COPY_FR_REFLINK asks for the destination file to refer to the existing copy of the data without actually copying it. Some filesystems (Btrfs, for example) are able to share references to file blocks in this way.

cp --reflink uses copy_file_range underneath.

1 Like

OK cool, that would be usable, I wasn’t aware of that.

This would be super useful. I limited myself to mentioning reflinking for files, simply because that is a more pressing matter, but if it could be implemented into the ordinary operation of Syncthing on blocks that would save a tremendous amount of space in certain cases as well (eg. shares with lots of shared blocks between different files). I currently have software running in the background that performs maintenance on the filesystem squashing duplicate blocks (ergo it hashes blocks in all files on the filesystem, detects duplicates and deletes those, leaving only a single copy of such blocks/extents referenced by all files that contain it), but integrating this functionality directly into Syncthing would save a lot of time and resources. Some software that perform similar tasks and are widely available include the following: https://btrfs.wiki.kernel.org/index.php/Deduplication

You can’t integrate it into syncthing as you can construct a two valid files if duplicate data is deduplicated and stored only in one of them.

This has to be done at the filesystem level.

Correct me where I’m wrong, but isn’t the following possible? For incoming files from other devices:

  1. Notice that files contain the same block hashes in various places (either in different files or within the same file).
  2. If for a given duplicate hash, the corresponding block exists on the local device, when writing the new files to disk simply point to the existing block on disk instead of copying into the new file traditionally.
  3. If the blocks with the same hashes are not yet available locally, proceed normally and add downloaded blocks to the block database. Later, when writing a duplicate block (for the second time), notice it is in the block database and locally available, and proceed with step Nº2.

For scanning for changes on the local device:

  1. If a new file contains blocks which hash to match hashes that already exist in the block database, pass the location (ergo the position and length in the file) to the kernel along with the existing block/extant. The Linux kernel already has support for deduplication IIRC, so all that is needed is a call to the kernel with the block/extent size and file positions to deduplicate.

Following these, all new files added to a share in Syncthing would always take up the minimum amount of space necessary with respect to avoiding copying the same data over. And even adding in files with redundant data blocks would be cleaned up simply by Syncthing noticing the match when performing a scan/hashing (which it does anyways) so this work could get the additional benefit of helping to reduce the size of the filesystem.

What does this mean?

This needs to be supported by the filesystem. There is no way to do this if the filesystem does not support it.

It also needs to make sense as a use case. For me, it wouldn’t.

$ ./bin/stindex -mode dup
Block 4584e1efe83ef545ed9e75ec09a730f7dcdd539ebc860ae9a7979d26c4eae317 (size 131072) duplicated 2 times
Block 36b0c868f13ff07127141a04817a6291b3a31053e3e88d661a95d275ba3752b9 (size 131072) duplicated 2 times
...
Block fa43239bcee7b97ca62f007cc68487560a39e19f74f3dde7486db3f98df8e471 (size 131072) duplicated 353 times
Block 9cd7247ea3a931bbec5e4c6537de808a850acaf75b7afe2d0ed29a08fda46eb8 (size 822) duplicated 609 times
Total data is 32878 MiB
Duplicated data is 68 MiB (0.2%)

(lightly hacked stindex to count duplicates in the index)

Oh, right, yes, this would only actually result in shared extants at the filesystem level if in fact the filesystem supports it. But the call to perform the copy could still be uniform across all linux machines as I believe the new reflink‐supporting copy functions have the option to revert to traditional copying if they detect they are on a filesystem that cannot reflink.

As for your comment, Borg, it’s certainly true that some people would benefit far more than others. However, the key is that if it were enabled only on Linux devices, the changes that would need to take place would require very little additional running/processing time. I am not familiar with the actual Syncthing code involved, but at least given that it is supported by the Linux kernel, the code to deal with the actual reflinking/filesystem changes might be as simple as using a different function to perform the copying (one which falls back to traditional copying if on the wrong filesystem). I suggested it mostly because Syncthing already hashes blocks and detects duplicates, which is the large majority of the work that needs to take place for this sort of functionality to exist. The rest might be as simple as telling the kernel to attept reflink copying prior to falling back to normal copying by using the corresponding copying function. I am not suggesting refactoring existing block databases or performing deduplicate squashing on existing data. Just to use a reflink supporting copying function on linux when new incoming blocks match existing ones on disk. Let the kernel sort it out via using the right flexible copy function and limit support to this ‘best effort’ reduction. This could have a huge impact for those with lots of duplicate data, which reducing the coding effort to support this as much as possible.