Debugging performance issues

emosuka · October 13, 2015, 10:33pm

Hi!

I’m running Syncthing 0.11.26 (Windows 64-bit) and I’m seeing some really poor performance syncing files over the internet. So I’m wondering if there’s any way to profile/debug/analyse what Syncthing is doing to perhaps find the bottleneck.

One end of the connection is a Gbit LAN connecting to a 100/100 Mbit enterprise-grade fibre internet connection via some enterprise-grade hardware. The other is a Gbit LAN with consumer-grade hardware and a 112/12 mbit cable modem internet connection.

Syncing within either LAN works fairly well. I’d expect more than the about 30 MB/sec I’m getting, but it’s not terrible for Gbit ethernet. Also, all devices on both LANs connect to the internet at their respective full speeds. I can create a VPN connection between the two LANs and move data through it in various ways at about 100 mbit/s going one way, and about 12 mbit/s the other way, showing both internet connections to be running just fine.

Syncthing is different, however. Whether connecting through a VPN or not, as soon as the connection is via the internet, Syncthing won’t climb over 1 MB/s, and most times it just hovers around 100 kB/s or less. Multiple pairs of Syncthing clients can total much more than 1 MB/s as further evidence that the bottleneck isn’t a lack of bandwidth at any point along the way, and it’s not either ISP deliberately throttling the connection because they don’t like Syncthing traffic.

But no single instance of Syncthing can receive or send more than about 1 MB/s over the internet, and even that is pretty rare. I’m also running it on FreeNAS now, but it behaves exactly the same as the Windows instances.

Any idea how I might proceed from here? I’m currently syncing a rather large folder for the first time (26.3 GB remaining, ugh…) and the speed seems to have settled around 50 kB/s. I know sync is supposed to be a background task, but it shouldn’t be 200 times slower than a straight-up file transfer, should it?

So yeah. Help.

AudriusButkevicius · October 13, 2015, 11:27pm

Think about what’s different between the two tests. If the number of peers different, etc?

Opening up the UI, will definitely affect the performance. I suggest you use external tools to check the speed, and not trust the UI, as usually when you open the UI a lot of calculations start happening and you start .

Also, when you say you can move data at 100mbit/s in various ways, what ways?

emosuka · October 14, 2015, 12:04am

There’s no difference in the two situations as far as I can tell, apart from the connection. Machines are hooked up like this, with no other peers in the test cases:

PC A - Gbit switch - PC B
PC A - Gbit switch - Gbit router - ISP - internet - ISP - Gbit router - Gbit switch PC B

In case 1 I get speeds of about 30 MB/s. Case 2 gives typically 100 kB/s, but up to 1 MB/s at times. Theoretically I should be getting up to 13 MB/s, at least when the uploader is the 100/100 mbit end.

The ways that I can move data at full speed are things like hosting an FTP server on one end, or copying across Windows shares (with a VPN setup.) And by that I mean it does something like 75% of the theoretical max, or 9-10 MB/s. For reference, a straight up Windows share copy on the LAN (either end) reaches 130 MB/s.

Also remarkable is that if I duplicate case 2, i.e. have two PCs on either end sharing the connection (but not talking to each other on the same end), i.e. just two parallel Syncthing networks, they don’t seem to impact one another. Each instance runs at the same (sluggish) speed, regardless of what the other is doing.

Windows Task Manager and FreeNAS both agree that the network usage when Syncthing is performing like this is around 1 mbit/s. Of course that may not count some forms of overhead, but there seems to be plenty of available bandwidth. E.g. I can run a speed test on the 100/100 mbit end and get about a 100/100 result while Syncthing is working.

I’ve tried not running the web GUI on either end and there’s no noticeable difference in speed (haven’t measured it precisely yet.) Although it’s worth noting that on the straight LAN connection (case 1) I still get 30 MB/s with the GUI running on both ends. And it doesn’t seem like the syncthing process is using a whole lot of CPU time, anyway. 0-1% mostly, much more with the GUI running but rarely more than 10%.

Of course, I’ve also tried restarting, rebooting, deleting and recreating the shares, and so on.

AudriusButkevicius · October 14, 2015, 7:17am

Is it a high latency network? Because we work in a request response manner, and by default allow only 16 in flight requests, not just shoving data down the pipe like ftp does etc…

emosuka · October 14, 2015, 10:29am

It’s not especially high latency, no. Nodes are close enough to each other geographically (can try tracert later, but sadly not now.)

Ping time is about 14 ms across the VPN (I’ll have to mess with the routers to test latency outside of the VPN, which must wait a few hours.) LAN ping time is 0.1 - 0.2 ms. So about a 100x difference. That corresponds roughly to the LAN/WAN transfer speed ratio, so I guess that could mean something.

Still, I’m not sure how the protocol works, but even if it has to do one request-response thing per 64k block, say, then it should still do up to 71 blocks per second. So that should be more than 4 MB/s.

In my own experience with TCP sockets on Windows there’s a lot that can go wrong, though. Rookie mistakes like following Microsoft’s recommended guidelines… you know? Connections choke up so easily and you get very little indication as to why. If only Syncthing weren’t written in Go. But whatever.

One thing I haven’t tried is running the FreeBSD version on both ends. That might be worth a try. See if I can’t find some way to make that happen.

Are there any verbose logging options I could try, maybe?

AudriusButkevicius · October 14, 2015, 10:45am

14ms is not a latency we should care about.

You can run STTRACE=model on both sides which will show how often requests are fired and how fast they arrive, but your BSD to BSD test is a best comparison, as I feel it will be some magic windows bullcrap.

You can also try increasing number of pullers for the folder in advanced config, to try and fill the pipe better.

emosuka · October 14, 2015, 8:45pm

Now I tried pinging from router to router, and the latency is indeed around 13-14 ms. No surprise there but as you say it shouldn’t be an issue anyway.

I tried increasing the number of pullers to 256, but that didn’t change anything.

Then I tried the latest beta because apparently there are changes to the protocol (not compatible with 0.11.26) but it behaves the same way. If the protocol has changed drastically in 0.12 (has it?) then that’s useful info, I guess, pointing at a lower-level problem. Right?

This is a snippet from the log with STTRACE=model on the receiving side. And this is the same on the sending side. Both are running the latest beta at the moment. Receiving end is Win7 Ultimate x64, the sender is Windows Server 2008 R2 x64. If that matters. Haven’t gotten around to setting up FreeBSD at the other location yet.

Not really sure what to make of it. There’s a varying gap between “REQ(out)” and “pullNeeded done” up to about a second which I’m guessing shouldn’t be much longer than the ping time given the available bandwidth is 100 mbps and the data sent across is barely over 128 kB.

Maybe that narrows it down? Could be Nagle’s algorithm backfiring? Cause I’ve had that problem before, specifically that sending small packets can halt the transmission for hundreds of milliseconds depending on various sub-ms timings that you wouldn’t think to worry about. And one thing in particular you really want to avoid is alternating between small and large packets. Not sure if that’s what Syncthing is doing. Still wrapping my head around the source code

Ugh. Reckon it would it be hard to set up an environment to build Syncthing and then stick some console output into the low-level network functions? You know, timestamps at interesting points. Never touched Go in my life, mind you, but I’m told it’s not that esoteric… (?)

AudriusButkevicius · October 14, 2015, 9:54pm

After changing the pullers, did you restart syncthing, and does the STTRACE=model print:

[WCAFV] 2015/10/14 22:06:21.302967 logger.go:296: DEBUG: rwFolder/TESTFOLDER@0xc082064780 c 1 p 256

256 being the crucial part?

0.12 has protocol incompatible changes, but it’s the same protocol underneath.

As for Nagle’s, it’s very possible, as we explicitly set the option to conn.SetNoDelay(false)(@calmh why?) in lib/osuti/osutil.go:SetTCPOPtions:228, and the pauses between Req(OUT) is very consistent fashion indicating 200-500ms delay (200ms delay twice I guess?).

I think you should test the FreeBSD/Linux version, or build a custom version without SetNoDelay.

The only thing you need for building syncthing is installing Go. After you’ve done that, you just run go run build.go in the checkout, and use the binary from the bin/ directory.

As for building syncthing, it’s fairly easy.

emosuka · October 14, 2015, 11:05pm

Actually in those logs it’s set to 16. But I did try 256 pullers with 0.11.26, as well as with 0.12.0b3 (after doing the logs), restarting several times in both cases. And the speed is the same. What does that setting do, anyway? Is it the number of concurrent TCP connections or just the queue length? Neither really seems to make sense.

I’m going to test the FreeBSD version but currently I have no more hardware for it to run on. Would have to run in a VM or maybe find a laptop somewhere. But it probably won’t be for a few days.

I’ll try to see if I can build Syncthing on Windows for now. Enabling TCP_NODELAY might very well do the trick. I’m not sure how Syncthing could really benefit from not enabling it, anyway, as opposed to manually bundling packets together before delivering to the socket. What do I know.

AudriusButkevicius · October 14, 2015, 11:10pm

Not sure, hence asked @calmh. VM might not work, as that will still go through Windows TCP stack.

emosuka · October 15, 2015, 2:05pm

So I managed to build Syncthing. Wasn’t hard at all. Sadly, enabling TCP_NODELAY had no effect.

Sticking a bunch of console output in protocol.go, it seems the only place there are any noticeable delays in the request-response cycle (sender side at least) is in this bit of code in send():

select {
case c.outbox <- hdrMsg{hdr, msg, done}:
  return true
case <-c.closed:
  return false
}

That’s about as far as I can get without actually having a clue.

Not entirely sure what’s going on here, but as I understand Go, the select statement does something funky like partially evauating all expressions, then randomly executing only one of them. Here the first case seems to be executed exclusively. Guess the second case is just error handling?

But why is the statement blocking at all? outbox is a channel, so shouldn’t it just append the message to a queue and return immediately?

AudriusButkevicius · October 15, 2015, 5:45pm

It’s a queue of 0 length, so the receiver part of the queue is busy doing something else, not reading, hence we block. You should find whats consuming stuff from the outbox, and understand why it blocks for so long between the reads.

emosuka · October 15, 2015, 8:48pm

Ok. The only other function using outbox is writerLoop. And it spends all its time in the Write function, specifically this line:

n, err := c.Writer.Write(bs)

All the rest of the writerLoop, including waiting for the next packet to appear on the queue, takes less than the Windows time resolution, which is something like 10 ms.

Right now at least, sending 131088 bytes to io.Writer takes about 220 ms. Which smells of Nagle’s bullshit, but that should be disabled. Also shorter messages transmit faster, which sort of rules it out. In fact the speed is more or less consistent at about 1.5 ms/kB or around 650 kB/s, regardless of message length and consistent with the speed reported by the GUI.

I guess I learned something, but looks like it’s back to troubleshooting the internet connection.

AudriusButkevicius · October 15, 2015, 9:17pm

It could be a misconfigured NIC with a small offload buffer, or something similar. It’s quite a common thing todo if you are running a SQL server which talks to clients over LAN or something.

I wonder if reads are just as blocky as writes. Perhaps the reader on the other end is just unable to consume the data fast enough, causing the send buffer on the sender to fill up.

emosuka · October 15, 2015, 9:58pm

That might be. I should do some similar tests on the receiving end maybe.

But for now I’m playing around with the internet connection, and I discovered something interesting at last. According to iperf each of my TCP connections seems to be capped at something like 5 mbps. Running multiple parallel connections, the bandwidth sums to about 50 mbps, possibly more (best I’ve seen so far, but it varies.)

So now I’m thinking the real culprit is my ISP. If I could get iperf to connect through the VPN I might learn more, but it won’t. And anyway if it is any indication then there’s no way Syncthing could perform better without using multiple TCP connections, anyway.

And I can’t be the only person to run into this (in fact it seems like a common problem judging by various forums.) Has anyone ever looked into adding this feature to Syncthing? I imagine it would cause a whole bunch of cascading concurrecy issues, but maybe some dev experimented enough at some point to have a working version I could try? Doesn’t matter if it’s unstable.

AudriusButkevicius · October 15, 2015, 11:09pm

No nothing was done regarding this, and it was suggested.

It’s an open code base, hence feel free to contribute.

emosuka · November 5, 2015, 11:17pm

Well I looked into it and it seems like a major rewrite of the protocol would be required. Although ST really should run multiple parallel connections, I doubt that’s around any of the next few corners. And I certainly am not familiar enough with Go to bother trying.

But I did keep trying other stuff and eventually something wonderful happened. I’m almost saturating a 100 mbit connection right now! Yay! It seems there’s a problem with the heuristics that Windows uses to restrict automatic TCP window scaling. Exactly what the problem is or whether Syncthing is specifically triggering it with some offensive pattern of network traffic, who knows, but disabling those heuristics worked wonders for me.

The magic console command is (run CMD as administrator):

netsh interface tcp set heuristics disabled

And for good measure:

netsh interface tcp set global autotuninglevel=normal

netsh then responds with a passive-agressive “Ok.” and the next time Syncthing creates a socket (so restart of ST is required, I assume) that socket will have unrestricted window scaling which won’t inexplicably start to restrict itself. It works for me on Windows 7 Ultimate SP1 and Windows Server 2008 R2 Enterprise.

Here is a relevant and very handy guide to optimising Windows TCP connections if anyone’s interested. Perhaps worth linking to in the ST documentation, if it isn’t already.

AudriusButkevicius · November 5, 2015, 11:22pm

The docs are on github, feel free to contribute.

I don’t think using multiple connections is such a big rewrite, as we use a io.Writer interface in most places, so we can plug any transport we want, relay, tcp, udp, multiconnection tcp etc, as long as it’s reliable and ordered.

I am in the process of refactoring it even more, in preparation for other transports.

emosuka · November 6, 2015, 11:32am

But I think it requires more than just plugging in another transport layer. However data blocks move across the network, Syncthing still only requests one at a time, waiting for each block to finish before requesting another. To benefit from multiple connections it would have to queue up a number of requests, distribute them to several threads and then be ready to receive incoming blocks out of sequence.

The quick-and-dirty alternative of just splitting each block into smaller subblocks at the transport layer is not very efficient (because of the resulting small block size) and would make each of the connections a potential bottleneck for the whole transfer.

Or, if I’m missing something obvious, all the more to suggest I’m not qualified to implement the feature.

calmh · November 6, 2015, 12:27pm

Negative. We request up to the number of pullers (per folder config value) at a time. By default 16 requests per folder are in flight at given a time. These are distributed among the available peer devices on a least-busy basis.