Weirdly heavy resource usage (is my repo too big? :P)

shish · July 31, 2017, 10:30pm

Repo:

~1TB data
~4,000,000 files, organised by md5sum with two layers of directories, eg “ab/cd/abcdef1234567890”

Network:

First node: Master server, 100mbps datacenter network
Second node: Mirror, 100mbps datacenter network
Third node: Mirror, 250mbps home fiber connection

Syncing from the first node to the second seemed to be taking forever, so I rsync’ed the data first, then added each folder to syncthing to hopefully keep them in sync from this point on. I then added the third node, and am waiting for that to sync naturally. Except, resource usage seems crazy high:

The master node is maxing out a CPU core and doing 10MB/s of disk reads pretty much constantly, even though the only change is a couple of new files per minute (maybe 1-5MB added per minute).
The second node is doing no IO at all, but is somehow using 2.5 cores of CPU.
The third node is downloading at 300KB/s, and is maxing out my disks (15MB/s with lots of seeking) and my CPU (both cores). (Is syncthing doing something like flushing the entire 2.2GB metadata index to disk every time it downloads a single file???)

How do I figure out why resource usage is so high and make it less high?

AudriusButkevicius · July 31, 2017, 10:51pm

The UI usually explains what who is doing. If it’s scanning, or if the files are constantly changing, you’ll just keep wasting CPU cycles forever, rescanning forever changing files.

shish · July 31, 2017, 11:27pm

Files are never modified - new files are created, and occasionally some are deleted.

According to the UI, the first two nodes are mostly “Scanning”. The second node has ~50GB free RAM, so maybe it is keeping the filesystem metadata cached which explains how it can be scanning and yet not doing any I/O… but even with everything in RAM, it still takes several minutes of processing to detect that there has been one file added. (Does syncthing stat() every file even when the directory mtime hasn’t changed?)

Even when the scanning is finished, and the first two nodes are just sitting there in sync, idly being “Up to date”, they’re still using several cores :S

The third node is “Syncing”, which makes sense because it’s the only node that doesn’t have all the data - but 15MB/s of disk IO to 300KBs/ of network IO seems out of balance…

I’ve now mounted .config/syncthing in a ramdisk - that seems to have brought the I/O down to 2-3MB/s, but still only getting 300KB/s from the network (With all CPU cores maxed out, so I guess that was the bottleneck all along. Are index updates O(n) to the size of the repo, so when I’ve got 4 million files, it takes 300ms to update the state of each block to say that it has been downloaded, so I can only download ~3 x 128KB blocks per second?)

I’mma try setting the rescan interval to 0 (IIRC that disables scanning?) to see if things settle down if I just leave it alone for a while…

ED> Maybe if disabling scanning helps, switching to inotify or something would be a good idea?

AudriusButkevicius · July 31, 2017, 11:40pm

If you have tons of files, you might want to set scanProgressIntervalS for the folder to a negative value in advanced config to disable scan estimations which is probably the cause here.

AudriusButkevicius · August 2, 2017, 5:49pm

Did this help?

shish · August 3, 2017, 11:34am

Sorry for the slow reply, work got hectic and I’ve been doing 16 hour days this week ^^

With periodic scanning and scan estimations disabled, the first two nodes marked as “Up To Date”, so as I understand it they should be doing nothing but sending data to the third node, and “Global Changes” confirms no changes have been made to the repo in the past 48 hours… they’re still pretty active Half the time they are properly idle, but half the time they’re still using 2 cores and ~5MB/s IO on the first server (+ 2 cores and zero IO on the server with enough free RAM to keep all the filesystem metadata in kernel cache)

Also the second server seems to have odd moments, like for half a second every 30 seconds or so, where it changes status to “Syncing (100%)” and then back to “Up To Date”

Third server is still doing ~5MB/s of IO in order to write ~300KB/s of actual data

AudriusButkevicius · August 3, 2017, 12:01pm

So if you run with STTRACE=model env var, on the receiving machine you should get a grip of what its doing.

system · September 2, 2017, 12:08pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.