I’m trying to find a source of a (minor) problem. I’m using newest syncthing 1.27.2 on linux. I’m quite experienced syncthing user, using it for last 4 years on a few machines.
New machine, which is 64bit arm linux, is very slowly downloading one directory from other syncthing servers. This directory contains around 380k small files (dozens KB average size) which is a big maildir directory. What do I man by saying slow? 8 days. It will take around 8 days to complete, 24h/day, resulting in 35KB/s transfer.
CRUCIAL: Other folders (multiple GB, but normal amount of files), were downloaded normally by this machine with speed reaching up to 50MB/s.
FS used is btrfs, and my examination shows it shall not be a problem (check above: CRUCIAL), although of course it is big. $(ls -1 |wc -l) takes around 7 seconds.
My suspicion is, that syncthing is doing some (required? unnecesary?) operation after each file is downloaded which slows down whole process.
QUESTION:
Is it a bug in syncthing? Is it not a bug but performance I shall expect ? Can I speed it somehow? /I can not change structure of folder, it must stay as one folder with 380k of files).
Any recommendations are welcome, can also do any required debuging on my side.
First of all: Lots of small files is the worst case scenario for performance.
The problem is that we need to make sure that the files are written to disk properly, so that we don’t lose data in the event of an unexpected crash or system shutdown. This involves a lot of fsync calls, which can put a strain on your storage.
If the files are stored on spinning rust, this can get quite nasty.
We should offer something in between “100% safety” and “YOLO, let’s have dataloss”.
e.g. Redis default fsync policy is to call fsync once a second:
you can have different fsync policies: no fsync at all, fsync every second, fsync at every query. With the default policy of fsync every second, write performance is still great. fsync is performed using a background thread and the main thread will try hard to perform writes when no fsync is in progress, so you can only lose one second worth of writes.
I wouldn’t want to change our default, but it sucks that users have only the choice between poor performance and dataloss.
I’ll try to disable fsync and observe the results (for others: yes, I do understand the outcomes. e.g I use non standard commit opt on this filesystem).
Two questions:
what do you think about trying first to tinker with some of the:
maxConcurrentWrites,copiers,harders, pullerMaxPendingKiB.
This is 4 core machine.
any other reason/solution? as I said, on normal folders, this very instance of syncthing is capable of sustained dozens of MB per second. Maybe some kind of new feature to handle it? /i can submit github issue if it helps)
Upon reading docs, I’ve started doing more tests in three stages. As an initial scan takes few hours, this will take some time. Will plan to write back with the results:
phase 0: (almost) defaults
phase 1: currently scanning, I started this just after initial post.
Folder Settings:
receiveonly,
FSWatcherEnabled=false
phase 2:
Global Settings:
databaseTuning = large
setLowPriority = false
Folder settings
hashers=4
copiers=4
maxConcurrentWrites=8
copyRangeMethod=copy_file_range
phase 3: FolderSettings:
fsync=false
Currently this machine shows Local State:
431,952 18,038 ~365 GiB
of which, 330k files (70%) is in this one directory, with 20GB (5%) of data.
Upon start, all 345GB and 100k files in 18k fs directories under 12 syncthing folders is scanned under 1 minute (55 secs to be exact), while the last folder takes 1-2 hours to end Scanning phase.
Thats a reason I’m looking at syncthing as a culprit in slow handling of directories with multiple files, due to some kind of superfluous file list refresh per every operation/sync etc. I may be wrong, and this might be the limit of my setup, yet in such scenario I’m wondering if some newly implemented syncthing feature to handle such scenario would be possible.
As to my test, phase 1, did NOT show any substantial changes. I’m currently waiting for phase 2 Scanning to complete (1-2hrs) to observe if syncing speed changed.
p.s. filesystem: btrfs on luks on mdraid on nvme on pciev3. capable of seq read/write of around 500MB/s. ram: 4gb. no other services on this host.
p.p.s.
INFO: Single thread SHA256 performance is 1006 MB/s using minio/sha256-simd (706 MB/s using crypto/sha256).
INFO: Hashing performance is 265.23 MB/s
thank you for the link, it has a lot of useful information.
I’ll do tests with --perf-no_read_workqueue and --perf-no_write_workqueue as phase 5 tests, and we will see.
What also bugs me is, that at least in Scanning phase, evidently only 1 core is used, as machine load is on 1.1x level all the time.
I’m not using GOMAXPROCS env, but maybe I shall set it manually to 4? AFAIR there is no way to check it on running syncthing (which I don’t want to interrupt during 2hrs scan test).
Here are the preliminary (last phase still running) test results.
TL;DR even most aggressive configuration changes did not help, and syncing speed is in the range of low KB/s.
Results:
phase 0: Scanning time: 160 minutes, Syncing: slow, dozens of KB/s
default settings (mostly)
phase 1: Scanning time: 160 minutes, Syncing: slow, dozens of KB/s
Folder Settings:
receiveonly,
FSWatcherEnabled=false
phase 2: Scanning time: 103 minutes (faster), Syncing: slow, dozens of KB/s
Global Settings:
databaseTuning = large
setLowPriority = false
Folder settings
hashers=4
copiers=4
maxConcurrentWrites=8
copyRangeMethod=copy_file_range
phase 2b: similar to phase2, however short test (partial Scanning only)
Global Settings:
GOMAXPROCS=4
Folder settings
hashers=8
copiers=8,
disableTempIndexes=true
phase 3: Scanning time: 105 minutes (comparable to phase 2), Syncing: slow, dozens of KB/s
Global Settings:
GOMAXPROCS=default
maxFolderConcurrency=4
Folder settings
hashers=8
copiers=8
maxConcurrentWrites=8
disableFsync=true # (sic!) this did NOT helped
phase 4: UNDER TESTING
System:
cryptsetup --perf-no_read_workqueue --perf-no_write_workqueue
copy-on-write disabled for syncthing db folder
main guess: CPU constraint due to usermode code (syncthing logic) and slow CPU.
Two observations:
syncthing is using ONLY ONE core per (syncthing) folder, for both scanning and syncing phase, so if a single CPU core is slow, while folder requires a lot of cpu usage, multiple cores do NOT help, as syncthing does not use them.
there might be possible syncthing optimization (besides implementing multiple cores per folder) for directories having many files (some kind of caching/grouping/sharding) as currently looks like syncthing does some form of scan for whole 380k entries list, per single entry sync.
Also: there is rather dubious reason to have copiers/hashers configuration parameters per ‘folder’ while not using more than 1 core per folder.
Please find below vmstat during sync & scan (no substantial differences observed). Please notice:
only 1 process running (while having 4 cores)
small IO activity = no IO layer constraint
one core fully used (US+SY ~= 25% = 1 CORE) = CPU constraint for single core code
no WA = no IO constraint
it is syncthing usermode code, not kernel code (fs/vm/luks) constraint, as US is >> SY
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 175436 11736 2526492 0 0 10 115 3 1 11 3 86 0 0
1 0 0 174176 11736 2526492 0 0 0 0 1565 1895 19 7 74 0 0
1 0 0 171404 11736 2526496 0 0 0 0 1001 1124 20 5 74 0 0
1 0 0 173624 11752 2526480 0 0 0 1564 1225 1426 32 2 65 1 0
1 0 0 173020 11752 2526512 0 0 0 0 1225 1658 20 6 75 0 0
Not sure about copiers, but hashers are only relevant when actually hashing files, which is the case only when Syncthing meets new or modified files in the scanning process. Otherwise, if the files haven’t changed, there is nothing to hash. Scanning on its own doesn’t necessarily include any hashing.
Although Syncthing is impacted, and some tuning might help, it looks like an external issue…
btrfs directly on a NVMe has a much higher potential throughput than what I’ve got (mSATA SSD):
# hdparm -t /dev/sda
/dev/sda:
Timing buffered disk reads: 1410 MB in 3.00 seconds = 469.32 MB/sec
One of my Maildir directories (3.6 GB) is only about 1/3 the file count, but all things being equal, the results should be comparable (no inline compression by btrfs or disk encryption):
time ls -1 ~/Mail/inbox | wc -l
111899
real 0m0.150s
user 0m0.106s
sys 0m0.047s
If I had 380,000 files it would take less than 1 second, so comparatively speaking 7 seconds for a directory listing means there’s significant overhead somewhere in your storage stack.
Because it’s email and your system appears to have CPU cycles to spare, if your Linux kernel supports it, enable inline compression in btrfs (use zstd and max it out at level 15). It’ll reduce the on-disk storage requirements by possibly up to 90% for emails without file attachments, but it’ll also greatly reduce the number of disk writes and speed up reads.
Also, since btrfs on top of LUKS, on top of a software RAID likely doesn’t benefit from checksums, consider disabling it to remove the extra disk writes.
(A few months ago I had a Maildir directory at work with ~1.8 million files totaling ~25GB – btrfs volume on a SSD – that was mirrored by Syncthing to a standby server. If it hadn’t been moved to a long-term archive I’d run some benchmarks for you.)
I already use zstd compression on this volume. Still as I see on the vmstat output, it shows where the bottleneck is:
1 core, and only 1 core is maxed. mostly by usermode code (syncthing), not kernel code (btrfs/luks/devices), and NO io constraint due to the low bi/bo numbers.
The problem is:
syncthing is not concurrent in regard to one folder for scanning (scanner.walk) and (some parts of?) syncing phase during processing list of 380k entries in one directory.
Of course, having faster CPU would help it (and here lies reason why your configuration works fine), but as other 100k files is being processed ultra fast on this very machine, it shows that syncthing logic is also a bottleneck.
having ability to use all cores for these operations would speed it up by 400%. probably other, more complicated features would allow to gain more.