Bursty, slow transfer

skandragon · May 28, 2015, 7:53pm

Greetings.

I have two hosts involved in a sync, where data is moving in both directions. The size of the collection is about 2 TB in files that range from a few bytes to 6 GB in size. These are many ISO images, and some meta information about them. The total number of files is around 1500.

I am seeing some strange behavior. The transfer (either side) goes from full line speed to zero, and seems to be in bursts. It will be < 100 bytes (yes, bytes!) for several seconds, then send a burst of around 5 megabits/sec (my slow line speed) , then fall back to 0 for many seconds.

I have verified I can copy a file using scp at near line speed, using a large enough file that the copy takes about a minute. I have verified this on the network graph that this is a sustained load around 5 megabits/sec, and that syncthing uses at most about 145 kbits/sec.

Any idea what I should look for? The obvious thing of restarting each end did not solve this. I don’t think it’s a system configuration issue with TCP, as other tools can sustain the transfer speed I expect. Also, this host can transfer locally at nearly 80 megabyte/sec to another host over its gig-e interface, which should rule out CPU or network faults on the host end.

Any idea where to look next?

AudriusButkevicius · May 28, 2015, 7:59pm

Can you check IO/CPU/Memory usage on both sides at the point it performs the transfer, and especially at the point the performance drops?

skandragon · May 28, 2015, 8:19pm

Slower box:

CPU: dual-core atom, which the OS presents as 4 cores. The overall CPU is sitting around 100% idle. During the burst, the “syncthing” process bursts to about 210% max. It is mostly idle, though, and is highly variable, bouncing from 0 to that upper level, but the average across any three second sample is less than 10%.

IO: Hard to measure (I’m used to using iostat!) but at peak, i/o rates are smaller by a factor of 100 than when I copy a file over to it from a LAN host. It is also mostly idle.

Faster box: 4 core slightly older Xeon. CPU never reaches more than 15% for syncthing, and 25% overall. I/O is idle, with bursts of a few transactions/sec. This raid array can do about 10,000 x in terms of raw speed what it is seeing now.

It could be the characteristics of how the process writes to the disk, but I don’t think the disk are the limiting factor. CPU speed could be an issue on the smaller machine, but even then I would expect it to be some sustained rate rather than mostly 0 with a few points where it’s using CPU.

SyncDuck · May 28, 2015, 8:28pm

There is an issue with larger files that has been fixed. Please see Slow LAN performance transferring a single 300GB file

Maybe this can fix your issue too. In short, too much memory and CPU is used the larger the file is.

skandragon · May 28, 2015, 8:38pm

Is it safe to use the build referenced in that issue, or is there a stable build I should use which has that fix?

AudriusButkevicius · May 28, 2015, 8:38pm

Not yet. There will be on sunday.

SyncDuck · May 28, 2015, 8:44pm

Works stable on several of my systems. More important, transfers go fast and don’t try to max out the CPUs. Also memory consumption is somewhat smaller. Maybe you can give it a testdrive without production data.

skandragon · May 28, 2015, 8:48pm

I’m already using a copy of production data. I’m not totally insane!

Do I need to update both ends to see the benefit?

SyncDuck · May 28, 2015, 8:52pm

I updated all clients, since transfer in my case goes both ways. In short, yes you should.

skandragon · May 28, 2015, 9:30pm

I’m using the FreeNAS (freebsd) jail version of this. It seems there is Deep Magic happening. Just replacing the binary isn’t so simple, but I’m running it on the console for testing.

Results so far: while the overall speed is slightly faster, the data coming from the slow machine is about the same, and the data going to to the slow machine is slightly faster, but still shows the same bursty behavior. It’s either full-on for 5-10 seconds, or full off for upwards of 30 seconds.

AudriusButkevicius · May 28, 2015, 10:00pm

It could just be the weak atom cpu, but as I said, it would be useful to know which stat increases as the pause happens.

skandragon · May 28, 2015, 11:12pm

I would still think the CPU would be pegged constantly, or nearly so. In this case, it’s idle for 45 seconds out of each minute (< 5% load) and bouncing up to 225% or so for the remaining 15. It’s not all together, but that’s what my traffic graph looks like.

calmh · May 29, 2015, 5:08am

The bug referenced above was on the sending side, basically responding slowly to requests for data. So make sure that fix is applied (it’s not dangerous), although I think you did that?

Running with STTRACE=model on the receiving side should reveal what it’s doing while “idle”, that may give some clues.

Also, checking disk busy percentage with iostat -xn 1 on the same box to make sure you’re not blocked on IO.

skandragon · May 29, 2015, 7:02am

When it is “idle” th only log messages that appear are:

[TV4IO] 2015/05/29 06:47:31.254308 progressemitter.go:58: DEBUG: progress emitter: timer - looking after 1 [TV4IO] 2015/05/29 06:47:31.254465 progressemitter.go:74: DEBUG: progress emitter: nothing new

I suspect these are related to the UI? This speed issue happens with or without the web UI running.

What I am seeing is groupings of REQ(out) calls for various offsets of the filename currently being downloaded, “idle” periods where only the emitter is logging, then back to REQ(out).

I’ve pasted a redacted log (filenames changed, machine keys changed) to http://pastebin.com/ah5n8nQc – this log is from the node that is having slow reception, and RRRRRR is actually the remove ID redacted.

skandragon · May 29, 2015, 7:14am

This feels like a queue problem. Since both sides are transmitting, and the link is actually a bit slow (6 Mbps or so when optimal, usually less) is it possible the requests for more data are falling behind the data being sent?

That is, it looks like this pattern repeats:

Request a bunch of offsets. Receive requests for one folder. Receive requests for another folder. wait for what may be a long time. Repeat.

Note that the “slow to receive” box is sending two folders to the remote, but only receiving one. This just feels like the requests are being delayed going out because the data dominates the time to send, and outgoing requests pile up behind the data.

calmh · May 29, 2015, 7:17am

If your upstream bandwidth is at its limit, this probably exactly what’s happening.

skandragon · May 29, 2015, 7:34am

No limits are hit other than for brief moments. My outgoing average is 1.5 mbps, which is 4.5 below my cap. My incoming averages to about 200kbits out of around 4.6 that is available.

While this transfer is going on, I can run a speed test (which is where I got these numbers) so the line is not being stressed.

Once again, I can also scp files which do hit my limits both directions, so the hosts in question do not have connectivity issues. and scp/rsync both encrypt data as well, so this doesn’t feel like CPU.

I just don’t understand what’s going on here since no limit I can find should really be causing this.

xamindar · May 29, 2015, 7:42am

Is your isp throttling/blocking syncthing traffic?

calmh · May 29, 2015, 8:12am

Hard to say. I don’t benchmark two devices both syncing large files to each other at the same time over a slow link (with high latency too?). It would be interesting to add. I suspect that the requests being queued behind the responses may cause everything to be latency limited for some reason, although it’s not super obvious to me why.

I could see the effective rate becoming about half of what the connection can handle, as in effect each side fires of a bunch of requests and the other side needs to send the responses to those before it’s requests can get out, and then vice versa. Sudden half duplex…

The point of sending multiple requests is to keep a lot of data in flight, but that fails when we can’t get the requests out there. Perhaps we could reorder the messages on the sending side to give requests priority, thus always keeping the expected number of requests in flight in both directions…

You could test this by unsharing one of the folders or so to make the transfer single-directional at a time. If that doesn’t go ridiculously slowly, then this seems to be the issue.

skandragon · May 29, 2015, 5:29pm

Even with data flowing in only one direction, I get at most 200k bytes/sec peak, averages out to about 170 kbytes/sec.

The use case for this bi-directional sync is initially to back up my home NAS to one I have racked up elsewhere. Eventually, it will be done, and from then on only rarely will large files flow both directions. It’s more likely i’ll download a set of files on the remote end, and have syncthing pull them locally.

I’ll set up some experiments when I can, but for now, I’m going to use rsync to do the initial sync from one side to bootstrap. If I then start up syncthing, will it detect that the files have been updated out of band and not destroy or change them?

Thanks!