Help Optimizing for 10gbps LAN?

alexmarkley · July 17, 2021, 5:27pm

Good afternoon! First, thanks to all the Syncthing devs for working on such an awesome project! In the past, I’ve used Unison as my file synchronization solution, but I’m finally switching to Syncthing because it looks better and more mature in just about every respect.

My primary use case is a large (but not massive) dataset of about 11 Tb (and growing) full of video data and associated project files/assets:

I have two main nodes, each running on FreeBSD with ZFS. These machines are identical, and have 64GB of ECC RAM, 8x x86_64 cores, and 10gbps ethernet.
Most of the time they will be running at separate sites, with a (slow) VPN connecting them, in which case we want to prioritize the real time synchronization of small files during the day (probably artificially throttling the pipe) while allowing large files to saturate the pipe overnight.
Occasionally, at the beginning of a new project, one of these machines will be physically transported to the other and plugged into the same 10gbe switch in order to receive some very large changes which would be impractical to transfer over the VPN.
In both cases, we want to be able to efficiently fill the available network bandwidth as necessary, unless the underlying storage is the bottleneck, in which case we want to saturate the disk throughput until the transfer is done.

What I’m observing:

Bandwidth between the two machines is very erratic and rarely peaks at what I think should be the maximum throughput.
Transfers seem to pause for seconds at a time.
I believe the receiver’s underlying storage write throughput should be the bottleneck at around 400 megabytes per second. However, in practice the receiver’s disks are underutilized.

Things I’ve already checked:

CPU is underutilized. Max cpu percentage is 50%, system load average is less than 3.
I’m running Syncthing inside of jails on both sides, with VNET bridged networking. It’s not great, but iperf3 shows me about 7gbps of practical throughput over the bridges between the jails, so the network should not be a bottleneck.
I’ve tested the performance of the underlying zpool pretty extensively. (I really think it should be the primary bottleneck.) However as described above Syncthing is unable to saturate the disk throughput.
For comparison, a simple remote tar/NFS job between the systems seems to be able to come much closer to hitting the maximum throughput of the disks.

Things I’ve attempted to optimize with Syncthing:

Tried QUIC vs. TCP. (Should go without saying, but relaying is disabled and all connections are direct connections.)
I changed setLowPriority to false.
I have tried setting various “max” values higher. In some cases it was difficult to find clear documentation on what they meant, so I’m not sure if I’m helping or hurting myself.
I think I managed to improve the situation when I adjusted pullerMaxPendingKiB to 524288 and maxConcurrentIncomingRequestKiB to 524288 on both sides. (See just after 12:50 on the graph below.) However, while the receiver’s disk utilization is now peaking (occasionally) at 100%, it is hardly sustained, and still seems to quit writing altogether for multiple seconds at a time.

Given all of the above, could somebody who understands Syncthing’s internals give me some pointers on how to optimize this configuration? Where might there be bottlenecks I’m not considering?

Again, I’m really impressed by Syncthing in many respects, and I am excited to put it to some good use here.

Thanks very much for taking a look!

–Alex Markley

AudriusButkevicius · July 17, 2021, 5:41pm

Giving larger network buffers might help, but that would have to be done on both sides.

Also, it’s not clear if you looked at all of these metrics on both machines or just one side.

Another thing that comes to mind is that syncthing runs fsync after every file, so if you are syncing many small files, your throughput will most likely be bottle-necked by how long it takes for fsync to flush things out. There is a advanced config flag to disable fsync, but obviously that can lead to data loss and corruption, so should only be used to validate whether that is a bottle-neck or not.

I believe that fsync will not show up in io stats, nor cpu stats.

Also, I don’t think you will ever get close to 10Gbps. Syncthing has to hash the data to verify it before every write, so the actual rate you can expect will be closer to the advertised hashing rate that is printed on startup.

AudriusButkevicius · July 17, 2021, 5:42pm

Another thing is TLS. I don’t think you can expect your CPU to sustain TLS decoding at 10Gbps. There is nothing you can do about that (in terms of disabling etc), but you will see that manifest as pinned CPU usage.

alexmarkley · July 17, 2021, 5:53pm

@AudriusButkevicius thanks for the reply! I would love to adjust the network buffer settings – could you perhaps clarify which specific settings you’re thinking of?

Also, I should clarify, I’m not expecting the network to be saturated on the 10gbe LAN. I expect the disks to be saturated in this situation. The emphasis on the 10gbe LAN is to point out that I don’t believe the network is the bottleneck.

Regarding fsync, I don’t think that should be disabled, nor do I think it should necessarily represent a bottleneck as long as Syncthing has a lot of different files it needs to pull. (Unless you’re implying that Syncthing only synchronizes one file at a time? In which case that will definitely be the bottleneck in this scenario.)

For the TLS I don’t think that’s the bottleneck because the CPUs are primarily idle. I’ve also disabled compression completely, so I am kind of scratching my head over here as to what else might be causing it to slow down so frequently.

Since you asked about more metrics, here are some more graphs:

CPU

Sender:

Receiver:

Network

Sender:

Receiver:

alexmarkley · July 17, 2021, 6:00pm

Here are a sampling of disk utilization graphs also.

The disk graphs are trickier because there is no one graph that tells the whole story. In both cases I expect the total read and write capacity to be higher (in megabytes per second) than a single disk, because the zpool has multiple vdevs, and each vdev is a mirror of two identical disks.

I’m happy to share more details about my zpool geometry on both sides if it would be helpful, but suffice to say I believe the write capacity of the receiver is not fully utilized.

Sender:

Receiver:

AudriusButkevicius · July 17, 2021, 6:08pm

Syncthing syncs 2 files at a time by default I believe, but even if it did 1000s of files, fsync, even though suggests it is implemented per file, usually syncs the whole drive, at which point all IO operations stop, so having 1000s of files blocked behind each other on fsync would not improve anything.

I suggest you either try disabling that, or as a test, transfer one single large file and see if the utilisation characteristics are different.

As for what settings to look at.

On the folder:

copiers
pullerMaxPendingKiB
hashers
maxConcurrentWrites
disableFsync

On options:

maxConcurrentIncomingRequestKib

Obviously, I am not suggesting any values, as it all depends on the hardware/parallelism available. Increasing some might be detrimental.

alexmarkley · July 17, 2021, 6:18pm

Is that controlled by the maxConcurrentWrites parameter? Or is there a different parameter for tuning that?

I’ve gone ahead and disabled fsync as a test. I’ll let it run for another 10-15 minutes and see how it affects the performance.

That makes sense. I’m just struggling a bit because it seems like the documentation is sparse on what these parameters actually do / how they are related to one another. Even a little bit of guidance on how to think about the parameters would be very helpful here.

Regarding pullerMaxPendingKiB and maxConcurrentIncomingRequestKib, as I mentioned above I adjusted these all the way up to 524288 on both sides. Given my situation and the graphs I’ve posted, could it possibly make sense to adjust these up any further? My sense is no, but it’s hard to guess.

AudriusButkevicius · July 17, 2021, 6:30pm

500MB of data in flight should be more than enough.

copiers decides how many files we transfer in parallel. maxConcurrentWrites is how many concurrent IO operations we do per file.

hashers is actually relevant for scanning, so not relevant in your case.

pullerMaxPendingKiB and maxConcurrentIncomingRequestKib are related to how much data we agree to have in flight. Namely pullerMaxPendingKiB controls when we stop sending requests to other devices once we’ve got this much unserved requests.

maxConcurrentIncomingRequestKib is for the incoming aspect, we stop serving (queue up) requests if we haven’t answered this much data worth of requests.

alexmarkley · July 17, 2021, 6:34pm

@AudriusButkevicius ah thank you, this is really helpful.

I think I might be starting to narrow in on the issue. Disabling fsync helped, but only to a point.

One factor about my dataset which I didn’t mention initially: I have a LOT of very small files in addition to the huge video files.

Switching to “smallest first” caused the overall throughput to crater to ~1megabyte/sec, so I’m going to try cranking up copiers and see if that helps the small file use case.

alexmarkley · July 17, 2021, 6:59pm

Yeah I’m struggling with small files. About 60% of the files in my dataset are less than 1MB, although the vast majority of the data is contained in a small percentage of (very large) ProRes video files.

Through this exercise I’m coming to the conclusion that my initial dataload should probably be done with a more batch-friendly mechanism (like streaming tar) and then syncthing should just be used to keep the replicas in sync.

I’m guessing it would be really nice for Syncthing to be able to bundle lots of small file requests into a smaller number of transactions whenever possible, that way we would be able to really fill the I/O buffers and enjoy a better maximum throughput.

I’m also wondering if it would be helpful to have an “in between” fsync option that could trigger fsync dynamically rather than just “after every file” or “disabled”. For example, if the number of bytes written since the last fsync is less than X, skip the fsync. (But if the synchronization is finishing, always perform a final fsync.)

alexmarkley · July 17, 2021, 7:27pm

On the large file side I am realizing I screwed myself over by starting and stopping the synchronization many times. (Lots of settings adjustments caused the sync to stop and start automatically.)

Apparently for each large (100GB+ file), if it’s been partially transferred, Syncthing does a lot of work before resuming the transfer. This appears to include lots of read-heavy I/O as well as a lot of very CPU-heavy single-threaded work. (Single core pinned at 100% for many minutes.)

I assume this is for safety, and that is great, but it means resuming large file transfers is expensive.

AudriusButkevicius · July 17, 2021, 7:28pm

I don’t think it’s the requests that cause the slowness.

We track each files version etc, so every file, no matter how small results in IO for database management. You can batch the data requests as much as you want, the database operations need to happen. Those are batched too, but it’s still code that is executed per file.

I am not sure what you mean by intermediate fsync, as fsync takes a filename, so you either call it for each file, or you don’t.

If you think that doing this at some point later for a batch of files, oppose to after every file, somehow makes it faster, it doesn’t, it still takes 100’s of milliseconds (on windows, it takes seconds) for each call, irrespective that it was called just before.

AudriusButkevicius · July 17, 2021, 7:29pm

Yes, resuming transfers is expensive, but most of the time cheaper than re-transferring the data.

We have no guarantees that the temporary file has not been been tampered with, so we have to re-read it and re-hash it to verify.

alexmarkley · July 17, 2021, 7:34pm

I wasn’t thinking of the implementation, but I guess I would think about maybe fflush() each individual file (although this shouldn’t be necessary if you’re closing the file handle when you’re done writing the file) and then intermittently sync()…?

Obviously this would be less safe than the explicit fsync per file, but it would be more safe than no sync at all.

Hence thinking of it as a potential “in between” option for advanced users to choose a little bit of risk/speed.

AudriusButkevicius · July 17, 2021, 7:40pm

fsync and flushing are completely different things.

calmh · July 17, 2021, 11:15pm

On the TLS side, keep in mind a single connection (as Syncthing uses) doesn’t get any help from more cores. So the aggregate cpu usage may not tell the entire story (you’ll see at most one core busy with tls) and, assuming you’re using Xeons, those usually have quite lackluster single core performance. And make sure the database is on ssd; database latency will hurt Syncthing a lot.

alexmarkley · July 18, 2021, 12:31am

@calmh thanks for responding!

That’s a great insight, and not something I would have necessarily gleaned from the documentation. That said, I haven’t seen any pegged cores on my system. But I will keep a closer eye on it next time I’m attempting a large sync.

These are actually both TrueNAS Mini XL+ boxes configured with 64GB of RAM and Atom C3758 8 core CPUs. I realize Atom doesn’t have a great reputation for any kind of performance, but supposedly the new Denverton architecture is pretty reasonable.

Anyway, if I could observe a pegged core during synchronization, that would be the smoking gun I would need to feel like I’ve identified the primary bottleneck.

And too, if TLS encryption (or even TCP overhead) is a bottleneck, it might be nice to have the option to establish multiple connections between the same two devices.

Also a really great insight. The database is on an internal NVMe drive, so that shouldn’t be an issue.

One of the issues I may be running into here is a general lack of experience with FreeBSD as opposed to Linux. I set up Syncthing in a custom jail because the “official” TrueNAS Syncthing plugin was several versions behind.

However, I’m not sure how jails work, and how much abstraction/virtualization I’m dealing with here. I’m pretty comfortable with Docker on Linux, so my mental model for BSD jails has been “roughly Docker with a persistent root”, but I could be wrong, and the jail could be introducing more of a performance hit than I am expecting.

All that said, with careful tuning I’ve been able to hit at least 70% of my expected throughput using syncthing to perform an initial transfer of the entire dataset. However, “careful tuning” includes using very different parameters for lots of small files vs. huge files.

It makes sense to me that syncthing is going to be “slower” than a naive tar copy, since it’s doing a nontrivial amount of additional work to guarantee data integrity, so I’m comfortable with the ~30% overhead there.

That said, it bothers me that there’s no single configuration that is optimized for my entire dataset. I have 170k files under 1Mb and 300 files over 5Gb. (And a growing number of those files are over 100GB!) Cranking up the number of copiers for the small files is great, but it’s a horrible idea for the big files.

Considering all this, it seems like there might be an opportunity for the algorithm to automatically adjust the parameters while the sync is running, but I couldn’t even begin to guess what that would look like.

Anyway, thanks again to all of you for helping me understand these parameters better and a huge thanks for making such an incredibly polished tool. I can already tell this is going to be massively better than what I was working with before.

system · August 17, 2021, 12:32am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.