Low Sync efficiency

I have two LANs with a 1mbit link between them. The source LAN has one syncthing instance, the destination LAN has two.

If I add a new, 10GB file for transfer on the source side (this quantity of data takes just about 24 hours) the source network instance pushes a full copy of the data to each of the instances on the destination network. As far as I can tell, the two instances on the destination network never share any significant amount of data. This observation is backed up by direct transfer via rsync or similar taking approximately 24 hours over the slow link while syncthing requires 48.

If I limit the connections so that the source instance is only allowed to talk to one of the two destination nodes, I observe that the redundant transfer stops, but the second destination node reports that “no connected device has the required version of this file” until the first destination node has completed its copy and transfers nothing in the interim…

Unfortunately I’m not terribly familiar with Go, so I’m not confident in my ability to correctly analyze this myself from the source. But it seems like maybe chunks aren’t being advertised as available for transfer until the whole file is completed?

For a big collection of small files set to random pull order this is fine, but for anything of significant size, or if all nodes pull files in the same order it seems to waste a lot of potential redundant transfer links since the even split of transfer bandwidth results in all nodes getting finished with their transfer from the first node at about the same time.

Everyone seems to think that this isn’t how it’s supposed to work, so maybe there’s something messed up in my config? Or maybe the status readout is lying when it says it can’t find a copy to transfer from? But then why the roughly twofold increase in transfer time?

It seems like maybe this needs some testing. Is there some way I can extract more information about what’s going on with a particular file? Or should I just write this up as a potential bug?

We do optimise this, and there are even some (questionable quality) tests to verify this should still work.

However, the default implementation tries hard to avoid disk thrashing, hence the file is split into N continuous parts, where N is the number of devices that current device is observing sharing the same folder, and each device downloads it’s own part before attempting to download the next part.

If N is let’s say 3, and the file is 100MB, device 1 will download 0-33MBs, first, device 2 will download 33-66MBs, etc. After that device 1 will move to 33-66, which then in theory can get some data from device 2, etc. But even then, it tries to spread the load across all the devices, sending requests to the device that has least outstanding requests (from the perspective of what we sent).

So speed-up, if any, will only happen towards the end.

You can, in advanced settings, configure syncthing to download parts of file in random order, that might make this speedup happen sooner, but at the cost of thrashing both, the only source and then receiver, and I’m not sure you’d win that much.

Comparing with rsync etc is not really a fair comparison, as rsync just shoves bytes down the socket as fast as it can, where as syncthing has to advertise stuff to others, maintain database state, perform cryptographic hashing for verification, etc.

2 Likes

Regarding the non-direct sharing, as per explanation above, the device goes to fetch it’s own part from the device from lan, but that device doesn’t have it yet, so it just backs off.

You can probably enable model debug logging, on each of the receiving devices and you should see them fetching different continuous regions from upstream, getting periodic advertisement messages stating what they have, and eventually starting to get data from each other.

Adding to the above excellent explanations, if I remember correctly we don’t start syncing a file if it’s not 100% available from somewhere at that time. So having a peer with 75% of the file available is not good enough, and you get the “nobody has this file” error until at least one connected peer is fully synced.

When I cut it down to only one pair of devices getting to talk across the slow link it seems fairly comparable to rsync. Sure, there’s a small amount of overhead, but it’s really not that much. Syncthing is reporting 900-950kbps transfer speed on a 1mbit link, and that matches with the file growth rate so really doing pretty well. Rsync only does a few kbps better.

So from your description, I’m guessing what’s happening here is that the two receivers are each pulling a third of the file across the slow link, and then one of them pulls across the fast link while the other goes for a second copy across the slow link, and then they both jump on the slow link again for the final piece, resulting in 5/6 of two copies having to go across the slow link…

It seems like that’s only going to be the optimal transfer pattern when the network bandwidth is as fast as or faster than disk IO.

Is there a manual somewhere that gives what advanced settings are available and what they do? They seem to be all “enter the right magic string or number” fields with no description, and I don’t want to break anything. Given that even my slow, spinning disks are an order of magnitude faster than the slow network link I suspect this is one case where thrashing will have less impact than not doing any more redundant uploads than necessary.

And is there an option to set a maximum part size? Avoiding disk thrashing probably helps quite a bit on rotating media, but it has diminishing returns as the file gets larger since the odds of the OS seeking over to some other file or metadata structure anyway scale to virtual certainty relatively quickly unless the shared folder is the only thing of interest on the disk. So being able to tell it (using your example setup of three machines and 100MB file from above) to split into 3 pieces, or into 10MB chunks, whichever results in more parts, would give a way to tune the tradeoff between disk thrashing and single-source transfer bandwidth limitations. Especially as people are moving to solid-state storage for non-bulk data and thrashing becomes a non-issue in many cases.

The block size is variable between 256k and 16mb depending on the file size. You can’t change that.

The settings are either in the docs website, or you gotta go dig through the code sadly :frowning:

The variable block size is fine. I wasn’t worried about that. I’m looking specifically at whatever’s making it ignore blocks that are outside of that N-way scheduling (“parts”) we were discussing.

Because, in the setup I currently have, I stopped communications between one of the two receivers and the originator so as to avoid redundant uploads across the slow link. The result is that, while the destination node which is still talking directly to the source has transferred 5 out of 9.2GB of the file (starting at about the one-third point, I checked with a hex editor) the second destination which is no longer talking directly to the source has received none of that data whatsoever. (It’s still only got the first 3.1GB that it downloaded while directly connected, which starts at the beginning of the file. So there’s plenty of data the two receiving machines could share, they’re just deciding not to try.)

So it seems like splitting the block distribution into one section per node precludes a lot of bandwidth sharing when N is small and the file size is large. There’s a lot of data that is available to be shared forward between downstream nodes, but that process doesn’t even start until they’ve each received their respective pieces from the original.

While this is likely the best strategy for lots of small files to avoid thrashing the disks, it seems like the performance will degrade substantially as the filesize/node ratio increases. Asymmetric link speeds will exacerbate it further as completion of whole files stalls out waiting on the last, large segment over a slow link, despite the same data being available on a fast connection from a different host.

A certain amount of this is unavoidable without making the algorithm cumbersome, but – assuming I’m not misunderstanding the description of how it works and the behaviour I’m seeing – increasing the size of N beyond the number of participating nodes in order to keep the transfer scheduling blocks to a reasonable size might let the system avoid bogging down when one node has a slow link while still avoiding thrashing the disks.

For example: The current sync I’ve got running three nodes means three pieces and the time required to transfer a third over the slow link before that third gets advertised as available for others results in 5/3rds of the data having to be transferred over the slow link. (Slower than if it sent all of the data to one recipient first and then moved onto the next which, given the relative link speeds, would only send 4/3rds of the data over the 1mbit and the rest over the 1gbit between recipients.)

If, however, parts were divided up as “one part per node, but no more than one gigabyte per part” then this file would be 9 parts, and it is unlikely that more than 11/9ths of the data would have to go over the slow link.

Obviously there will be some crossover point where the disk thrashing causes a bigger bottleneck than the slow link being in the mix, but given that the majority of nodes are both sending and receiving simultaneously, the probability of multiple extents on disk, and that the OS presumably has other things to do with the disk that will be interrupting anyway I would be very surprised if parts larger than 1GB showed any significant reduction in thrashing. A lot of disk defragmenters don’t bother with moving fragments larger than 300-400MB for that matter because it just isn’t worth the headache.

I’ll see if I can make sense of the options and/or code well enough to play with it a bit. Worst-case for my particular purposes I can probably make it behave itself better by whacking the big files into 100Mbyte pieces myself and reassembling as-needed.

Thanks for the explanation of what’s going on behind-the-scenes.

The code for mostt of it is here:

N is not something you can configure, it’s just a number of devices that share that folder.

From what you describe, I’d expect pulling the “unique range” (total size / total devices) and then random ranges of a lower size (1-10MB? < total size / total devices) would be best - better distributes load after the initial distribution. However maybe there’s other use-cases where that would be worse.

@tlhonmey did you already try this? It should be configured in Actions → Advanced → your folder → Block Pull Order → replace “standard” with “random” (possible options are in syncthing/blockpullorder.go at main · syncthing/syncthing · GitHub, did not find anything in docs about it)

I guess for this slow link it could still lead to some speedup even if it is bad in most cases?

I will give random pull order a try along with everything else. Thanks for the link to the options, trying to find the list of allowable values for the advanced options is like going on a scavenger hunt. :smiley:

Ok, so I figured I’d try adding my own block pull order calculator to make it easy to compare a few hybrid algorithms that won’t choke on asymmetric links as much as standard and hopefully won’t thrash the disk as much as random.

But I ran into this as I was trying to dissect the structure:

// Code generated by protoc-gen-gogo. DO NOT EDIT.

// source: lib/config/blockpullorder.proto

There is no lib/config/blockpullorder.proto…

Furthermore, blockpullorder.pb.go very obviously contains generated code that I have no wish to try to decode and replicate by hand and blockpullorder.go obviously doesn’t have sufficient data to generate much of anything.

I’m guessing the comment needs updating. If I can get a hint I’ll keep trying to do this the tidy way. Otherwise I’ll end up doing it the ugly way and whatever I come up with will probably have to remain a local patch.

I think protos are all in their own proto directory now. So the file you’re searching for is here:

Thanks, I’ll see if I can wrap my head around what talks to what.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.