Syncing large Virtual Machine Backups possible?

Hello,

I am currently searching for a solution to sync approximately 10-15TB of data (backups of virtual machines/containers - format: “ploop”) from one data center to another data center on a daily base. The change-delta within the Ploops is expected to be around 200-400GB per day. Since the network bandwidth is a significant concern with these data volumes, I came across Syncthing because I liked the Changed-Block Synchronization feature.

Now, I’m wondering if I can use Syncthing for this use case with such large amounts of data?! Are there any considerations? Any tips & tricks?

I appreciate your feedback.

Thank you, best regards from Tyrol,

Andy

Network bandwidth is one thing, but you will need extremely beefy hardware to manage scanning such large files in a reasonable amount of time in order to detect changes before the actual synchronisation. On the sending side, both the CPU and I/O needs to be very fast, and on the receiving side, at least I/O will matter a lot.

1 Like

Do you actually need bidirectional continuous syc? Or do you just need to back the images up every now and then?

Syncthing also uses sha256 for hashing, which I suspect will be quite stressful for the cpu for such large sizes.

If you just need to copy files from A to B a few times a day, rsync is the way to go, it has the same “only sync what has changed” features, does not use expensive hashing, and probably doesn’t have the large overhead that syncthing has to make it continuous and peer to multiple other peers.

2 Likes

I don’t know too much about this. But your saying the ploop files are large files with only small changes in them and you want syncthing to be able to send only the changes to the remote location without sending the whole ploop file, right?

And you’re trying to minimize transit costs between the two data centers. Is that right?

I presume the VM’s are always running and the ploop files are always being written.

1 Like

Probably best to just try it and see.

I’d suggest using “send only” and “receive only” as appropriate so you don’t end up with some weird case where syncthing is trying to modify a file that is for a currently running VM.

To the maintainers there’s no concept of a “shadow copy” right? Where a file is basically a snapshot that is transferred?

I’m just thinking if the VM file is always being modified, how likely is it that the remote site will actually get a complete file, if the file has changed before the transfer actually completes.

2 Likes

I’m working with different virtual disk image formats other than ploop, but most require the same considerations when transferring between servers:

  • Size of the largest backup file?
  • Are new backups overwriting existing backups with the same filenames or are the filenames all unique?

Temp files…

With Syncthing, the server at the other data center will always require enough free storage space to hold at least the largest backup file long enough to assemble the temp file before it’s renamed into place.

With rsync, its --inplace option can be used to skip creating a temp file, directly patching an existing file, reducing storage requirements and potentially lots of disk I/O.

Compression…

Rsync currently offers more flexibility, including setting the compression level and support for zstd (aka., “Zstandard”).

Zstd will use multi-threading if available, while gzip is single-threaded, so if the backup server has suitable hardware, there could be significant speed differences.

Zstd also offers higher compression levels than gzip.

Since ploop files are a variation of disk images, unless a VM/container is really full, it should compress well.

Filenames…

If backup filenames are always the same, then rsync has the advantage. Otherwise you’ll need some creative scripting to conserve network resources.

Syncthing breaks files into blocks, and blocks can be reused across files during syncing, so it doesn’t matter if a backup filename changes (e.g., includes a timestamp).

2 Likes

First of all let me say THANK YOU ALL for your feedback!! :slight_smile:

On “Sender Side” we got HPE ProLiant DL360 Gen10 Servers with Dual Intel Xeon 4214R CPUs, 192 - 256GB RAM and the Ploop Backup Data residing on Samsung SSDs. (Guess 400 - 500MB/s read/write for those SSDs). The “Receiving Side” is not bought yet - so we can take care on the needed configuration.

Actually it’s One-Side only. We want to sync our VM/CT Images from our “Main Datacenter” to our “Desaster Datacenter” to have all needed Data geographically separated. We’ve already tested with rsync - but unfortunately it’s not useable with several hundret GB to 2-3TB large Files because of this Bug (not really Bug … but it’s internal logic for determining the Delta Changes): rsync hangs with some specific files, using 100% of 1 CPU core but without disk nor network activity · Issue #217 · WayneD/rsync · GitHub

excactly - we are speaking about 10-15TB of Data where we “only” got a few hundred GBs of deletions/changes/adds per day. Main intention is to be able to sync both DCs in a fair amount of time. It would take too much time to sync all 10-15TB of Data every Day … that’s why we need to only sync changes.

the VM Images are always changing - that’s why we snapshot them on a daily base and backup the then static image. This image then should be synced to DC2 - so no need for shadow copy or something like that - we already got the static Backups needed for syncing.

We are already in the process of doing so - but I guess we are not the first to try such a setup and of course it would be interesting to get some feedback of ppl already tried this! :wink: Also at the moment it’s just a Lab Setup with Hyper-V VMs. I will post some findings of our tests in my next answer - as this would grow too large! :wink:

The largest Image is about 3,5TB - growing - so pretty large already, new Backups overwrite the old ones currently. Unfortunately with our tests with rsync it showed rsync will slow down to almost stall when syncing such large files becauce of rsync hangs with some specific files, using 100% of 1 CPU core but without disk nor network activity · Issue #217 · WayneD/rsync · GitHub. And I fear there is no quick fix in sight as the issue was created over a year ago. (btw - good to know about the multi-threading of Zstd - will have a look into! :slight_smile:)

Hey there,

unfortunately, it appears that Syncthing is not suitable for this use case, which is quite disappointing. Over the past week, I’ve conducted several tests, and the results consistently indicate that its performance falls short when dealing with large files.

I conducted most of these tests using a 50GB small virtual machine root file. Both the source and target storage devices are high-speed SSDs, and I tested them locally over a 10GBit connection. Additionally, the servers involved in these tests are equipped with Dual Xeon CPUs and ample RAM.

Here are the timings I observed for the 50GB file:

  • Approximately 9 minutes for “Scanning.”
  • About 47 minutes for “Preparing Sync.”
  • Approximately 1 hour and 44 minutes for the actual “Syncing” process.

In total, it took roughly 2 hours and 40 minutes to synchronize a 50GB file using Syncthing.

In comparison, the same file only took about 12 minutes when synchronized using “rsync,” as shown in the output below:

sent 3,235,311 bytes  received 2,404,797,940 bytes  3,305,467.74 bytes/sec
total size is 53,007,613,952  speedup is 22.01

real    12m8,381s
user    0m40,895s
sys     1m59,266s

I had high hopes that Syncthing would perform at least as well as rsync, if not better, so it’s genuinely disappointing to find that it falls significantly short of our expectations.

really a pity! :-/

Andy

Did you take a look at Configuration Tuning — Syncthing documentation?

Especially weakHashThresholdPct may be quite critical, and a non-default copyRangeMethod should be quite a large win, if supported, but the other stuff is also relevant.

Also, probably things like block ordering. sharing partial downloads etc should be disabled assuming its only two devices involved.

You should also see where the bottleneck is, is it cpu or io etc.

we are speaking about 10-15TB of Data where we “only” got a few hundred GBs of deletions/changes/adds per day. Main intention is to be able to sync both DCs in a fair amount of time. It would take too much time to sync all 10-15TB of Data every Day … that’s why we need to only sync changes.

What about an intermediate step? Let Borg/Kopia/restic/zpaq do incremental backups of the images and sift out the changes, then transfer only the updated data with rsync/Syncthing. Restoring the images would take additional time, though.