Initial v0.9 benchmarks

calmh · July 24, 2014, 7:27am

So the point of the v0.9 release is supposed to be improved scalability and performance. I’ve started doing some light benchmarking to check that I’m on the right track, and I think it looks good. In this test I have two nodes and a single repo. The repo contains 11 GiB in 175000 files. Node 1 has all the files and is indexed and ready, node 2 is completely blank. I measure CPU and memory usage for node 2 from startup to “in sync”.

Here’s v0.8.20 (the x scale is in seconds since start):

There’s an immediate spike up to 150 MiB of RAM when it receives the index, then it … crunches? that index at 100% CPU for almost five minutes. I honestly don’t know what’s happening there, but it’s reproducible. This is measuring I should have done earlier, because that’s creepy. Then it starts syncing at about the 280 second mark, finishes at about 450 seconds and then continues crunching at 100% CPU until I interrupt it. Memory is allocated along the way as files are synced and added to the local index, ending up just over 250 MiB.

Here’s v0.9.0 beta (the scales are the same as above for easy comparison):

Here we can see an initial increase in memory over the first 60 or so seconds as indexes are received and the pulling starts. There’s no real delay until it starts syncing, as it was with v0.8. It’s done syncing at about 400 seconds and CPU usage drops to almost nothing, memory usage constant at just under 100 megs. Some things to note here:

v0.9 actually takes longer to sync and uses more CPU doing it, counted from when it starts. This is a reasonable tradeoff because we’re doing a lot of database transactions in the meantime. In v0.8 we skip all that since the “database” is just a map in RAM. Even so, it finished the race a little earlier. In an actual setup with non-infinite bandwidth between nodes the difference should be smaller.
The memory usage here was a difference of 2.5 times. But given more data, v0.8 would balloon further while v0.9 will stay constant.
There’s for sure more to do to reduce the memory footprint of v0.9, but just getting it to a constant level instead of being a factor of the synced data is a good step.

muff1nman · July 31, 2014, 11:41pm

I think these benchmarks are awesome! Not necessarily from a performance standpoint (though that too), but more from a project maturity standpoint. Would you mind sharing the procedure behind these benchmarks so they can be reproduced?

calmh · August 1, 2014, 12:02pm

Sure. It requires building syncthing yourself with ./build.sh "" -tags perfstats (yes the empty quotes are necessary). When built this way it will create a CSV file called perfstats-<pid>.csv when run, containing four columns: timestamp in milliseconds since startup, CPU usage in percent, current heap allocation in bytes, current system RAM allocation in bytes. A new row is written every 250 milliseconds. It’ll look something like this:

$ cat perfstats-66356.csv 
250	0.000000	2174816         6424824
500	114.701147	18884120	27744504
751	151.408038	26774696	40274168
1000	126.631446	28547768	40798456
1250	207.712356	26541856	40929528
1500	171.936890	26914416	42174712
...

The difference between the two memory allocation numbers is the garbage collector; the first one is what syncthing actually has in use internally, the second how much syncthing has been allocated from the operating system. From an external point of view, the second value is the more meaningful (and is what is showed in the GUI).

Then I create plots using gnuplot, very simple and quite ugly:

set term png medium size 600,400

set grid x y
unset autoscale y
set autoscale ymax
set ytics nomirror
set xrange [0 : 80]

set output "cpu.png"
unset key
set title "CPU Usage"
set format y "%.0s %%"
plot filename using (($1)/1000):2 title "CPU" with lines

set output "mem.png"
set key left
set title "Memory Usage"
set format y "%.1s %cB"
plot filename using (($1)/1000):3 title "In Use" with lines, filename using (($1)/1000):4 title "Allocated" with lines

$ gnuplot plot.gnuplot -e 'filename="perfstats-66356.csv"'

calmh · August 1, 2014, 12:21pm

Note also that STPROFILER=:9090 syncthing will give you a profiler on port 9090 (even with the standard build). This needs to the Go toolchain to use. To inspect currently allocated memory, for example:

$ go tool pprof http://localhost:9090/debug/pprof/heap
Read http://localhost:9090/debug/pprof/symbol
Fetching /pprof/heap profile from localhost:9090 to
  /var/folders/9y/vgzklz8d48nbkxphhl3bb7yr0000gp/T/yVw0RvFafS
Wrote profile to /var/folders/9y/vgzklz8d48nbkxphhl3bb7yr0000gp/T/yVw0RvFafS
Adjusting heap profiles for 1-in-524288 sampling rate
Welcome to pprof!  For help, type 'help'.
(pprof) top
Total: 25.4 MB
    12.0  47.2%  47.2%     12.0  47.2% code.google.com/p/snappy-go/snappy.Decode
     4.0  15.7%  63.0%      4.0  15.7% github.com/syndtr/goleveldb/leveldb/memdb.New
     1.5   5.8%  68.8%      2.5   9.7% github.com/calmh/syncthing/protocol.(*rawConnection).readMessage
     1.5   5.8%  74.5%      2.9  11.6% github.com/calmh/syncthing/protocol.(*rawConnection).writerLoop
     1.5   5.8%  80.3%      1.5   5.8% github.com/calmh/xdr.(*AppendWriter).Write
     1.0   3.9%  84.3%      1.0   3.9% crypto/tls.(*block).reserve
     1.0   3.9%  88.2%     13.0  51.2% github.com/syndtr/goleveldb/leveldb/table.(*Reader).readBlock
     0.5   2.0%  90.2%      0.5   2.0% allocg
     0.5   2.0%  92.1%      0.5   2.0% bytes.makeSlice
     0.5   2.0%  94.1%      0.5   2.0% github.com/calmh/syncthing/beacon.(*Beacon).reader
(pprof)

Apparently the snappy decoder has allocated a majority at this point (although only 12 megs absolute), and we can look at what caused that:

(pprof) web snappy.Decode
Total: 25.4 MB
After focusing on 'snappy.Decode': 12.0 MB of 25.4 (47.2%)
Dropping nodes with <= 0.1 MB; edges with <= 0.0 abs(MB)
Loading web page file:////var/folders/9y/vgzklz8d48nbkxphhl3bb7yr0000gp/T/UNk5wxsExc.0.svg
(pprof)

Here we can trace the “guilty” parties to the file puller (5.5 MB) and the scanner (4.5 MB) and then of course dig deeper. Obviously requires some staring at the code to make sense of, but still kind of cool to have the ability in a production build.

djtm · August 12, 2014, 8:02am

Thank you very much, that’s extremely helpful. Do you think that this is true?

Allocated Memory = In Use Memory + Uncollected Garbage Memory

I want to look into the memory issues some more. And I wonder how much memory tweaks for garbage collection could save. Or do you have another good starting point idea for those tweaks? Thanks!

calmh · August 12, 2014, 9:38am

Yep. There are some things to do, I have some hacks going in the “heapprof” branch. Specifically, iterating over items in the on disk database generates a lot more garbage than necessary. I only started hacking on this yesterday evening, but I added the file https://github.com/syncthing/syncthing/blob/heapprof/cmd/syncthing/heapprof.go which (when compiled with the heapprof flag) writes a heap profile whenever the heap usage has increased. This can be used to inspect the heap profile at peak usage with the profiler, to pinpoint which areas are allocating the most.

andru · September 1, 2014, 9:24am

Just wondering if Jakob or anyone has grabbed networking performance numbers as yet. For instance, if I have 100 x 100MB files and copy them using: SyncThing, FTP, SFTP, CIFS/Windows Share, rsync, TeraCopy, what is the delta?

calmh · September 1, 2014, 12:03pm

I wouldn’t count on syncthing being a particularly fast way to just send a bunch of files from A to B; that’s not the primary functionality, and there are a bunch of other highly optimized projects where that is the the focus.

That said, as long as you have some CPU and disk performance, the bottleneck should be the network in most of those cases.

andru · September 3, 2014, 2:22am

Thanks, was just wondering what the general throughput was. Seems pretty good. I’ll try to get some numbers when I get my linux box online.

meeyn · September 3, 2014, 2:32am

Is that avail on Android as well?