380k files, 24GB in directory results in very slow sync = 8 days

I’m trying to find a source of a (minor) problem. I’m using newest syncthing 1.27.2 on linux. I’m quite experienced syncthing user, using it for last 4 years on a few machines.

New machine, which is 64bit arm linux, is very slowly downloading one directory from other syncthing servers. This directory contains around 380k small files (dozens KB average size) which is a big maildir directory. What do I man by saying slow? 8 days. It will take around 8 days to complete, 24h/day, resulting in 35KB/s transfer.

CRUCIAL: Other folders (multiple GB, but normal amount of files), were downloaded normally by this machine with speed reaching up to 50MB/s.

FS used is btrfs, and my examination shows it shall not be a problem (check above: CRUCIAL), although of course it is big. $(ls -1 |wc -l) takes around 7 seconds.

My suspicion is, that syncthing is doing some (required? unnecesary?) operation after each file is downloaded which slows down whole process.

QUESTION: Is it a bug in syncthing? Is it not a bug but performance I shall expect ? Can I speed it somehow? /I can not change structure of folder, it must stay as one folder with 380k of files).

Any recommendations are welcome, can also do any required debuging on my side.

State as of this post Jan 20th:

First of all: Lots of small files is the worst case scenario for performance.

The problem is that we need to make sure that the files are written to disk properly, so that we don’t lose data in the event of an unexpected crash or system shutdown. This involves a lot of fsync calls, which can put a strain on your storage.

If the files are stored on spinning rust, this can get quite nasty.

You can disable fsync which might make it faster, but at that point you have a chance of your data being corrupted in case of a powerloss etc.

We should offer something in between “100% safety” and “YOLO, let’s have dataloss”.

e.g. Redis default fsync policy is to call fsync once a second:

you can have different fsync policies: no fsync at all, fsync every second, fsync at every query. With the default policy of fsync every second, write performance is still great. fsync is performed using a background thread and the main thread will try hard to perform writes when no fsync is in progress, so you can only lose one second worth of writes.

I wouldn’t want to change our default, but it sucks that users have only the choice between poor performance and dataloss.

Thank you kindly for fast response.

I’ll try to disable fsync and observe the results (for others: yes, I do understand the outcomes. e.g I use non standard commit opt on this filesystem).

Two questions:

  1. what do you think about trying first to tinker with some of the: maxConcurrentWrites,copiers,harders, pullerMaxPendingKiB. This is 4 core machine.

  2. any other reason/solution? as I said, on normal folders, this very instance of syncthing is capable of sustained dozens of MB per second. Maybe some kind of new feature to handle it? /i can submit github issue if it helps)

Most of the tuneable knobs and their effect on performance are dxplained in this section of the docs:

https://docs.syncthing.net/users/tuning.html

Could you add a bit more details about your setup? HDD/SSD? RAID? Filesystem? RAM?

Upon reading docs, I’ve started doing more tests in three stages. As an initial scan takes few hours, this will take some time. Will plan to write back with the results:

phase 0: (almost) defaults

phase 1: currently scanning, I started this just after initial post.
 Folder Settings: 
  receiveonly,
  FSWatcherEnabled=false 

phase 2: 
 Global Settings:  
  databaseTuning = large
  setLowPriority = false
 Folder settings
  hashers=4
  copiers=4
  maxConcurrentWrites=8
  copyRangeMethod=copy_file_range

phase 3: FolderSettings: 
  fsync=false

A few hours for 300k files?

Currently this machine shows Local State: 431,952 18,038 ~365 GiB of which, 330k files (70%) is in this one directory, with 20GB (5%) of data.

Upon start, all 345GB and 100k files in 18k fs directories under 12 syncthing folders is scanned under 1 minute (55 secs to be exact), while the last folder takes 1-2 hours to end Scanning phase.

Thats a reason I’m looking at syncthing as a culprit in slow handling of directories with multiple files, due to some kind of superfluous file list refresh per every operation/sync etc. I may be wrong, and this might be the limit of my setup, yet in such scenario I’m wondering if some newly implemented syncthing feature to handle such scenario would be possible.

As to my test, phase 1, did NOT show any substantial changes. I’m currently waiting for phase 2 Scanning to complete (1-2hrs) to observe if syncing speed changed.

p.s. filesystem: btrfs on luks on mdraid on nvme on pciev3. capable of seq read/write of around 500MB/s. ram: 4gb. no other services on this host. p.p.s. INFO: Single thread SHA256 performance is 1006 MB/s using minio/sha256-simd (706 MB/s using crypto/sha256). INFO: Hashing performance is 265.23 MB/s

That are three layers that add overhead.

Make sure to disable read- and writequeues for LUKS. Also enable discard. Add this to your crypttab:

luks,discard,no-read-workqueue,no-write-workqueue

Source: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/

It’s also important that Syncthings database resides on a fast storage. I’d disable btrfs copy-on-write for that folder.

thank you for the link, it has a lot of useful information.

I’ll do tests with --perf-no_read_workqueue and --perf-no_write_workqueue as phase 5 tests, and we will see.

What also bugs me is, that at least in Scanning phase, evidently only 1 core is used, as machine load is on 1.1x level all the time.

I’m not using GOMAXPROCS env, but maybe I shall set it manually to 4? AFAIR there is no way to check it on running syncthing (which I don’t want to interrupt during 2hrs scan test).

GOMAXPROCS already defaults to the number of cores.

Yet regardless of hashers=4 during Scanning phase, only 1 core is used.

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 484476   5708 2285700    0    0    10   116    0    6 11  3 87  0  0
 1  0      0 484476   5708 2285700    0    0     0     0  609  450 24  1 75  0  0
 1  0      0 484476   5708 2285792    0    0     0     0 1012 1273 20  5 75  0  0
 1  0      0 446676   5708 2285792    0    0    16     0 1429 1442 31  8 61  0  0

from vmstat above it looks like cpu constrained, with only 1 core being used, mostly for user, not kernel tasks.

thats why syncthing is still my culprit :slight_smile:

Some kind of processing of this huge list of 300k+ entries residing in one directory, being done constantly in a loop.

all other reasons (luks, io, kernel) are debunked by one of the arguments below:

  1. other 100k files and 100x of data is scanned 100x faster (1 minute vs 2hrs)
  2. vmstat showing no high io activity, and no WAiting for data, and small 1 core US>>SY activity.

fsync is per file, not a global thing, redis calls it on it’s single rdb file once a second, we have 300k files here.

there is some sort of global thing, but it’s unimplemented by most modern filesystems from what I understand.

1 Like

Yeah, those are not portable and OS specific.

Here are the preliminary (last phase still running) test results.

TL;DR even most aggressive configuration changes did not help, and syncing speed is in the range of low KB/s.

Results:

phase 0: Scanning time: 160 minutes, Syncing: slow, dozens of KB/s
  default settings (mostly)
phase 1: Scanning time: 160 minutes, Syncing: slow, dozens of KB/s
  Folder Settings: 
    receiveonly,
    FSWatcherEnabled=false
phase 2: Scanning time: 103 minutes (faster), Syncing: slow, dozens of KB/s
  Global Settings:
    databaseTuning = large
    setLowPriority = false
  Folder settings
    hashers=4
    copiers=4
    maxConcurrentWrites=8
    copyRangeMethod=copy_file_range
phase 2b: similar to phase2, however short test (partial Scanning only)
  Global Settings:
    GOMAXPROCS=4
  Folder settings
    hashers=8
    copiers=8, 
    disableTempIndexes=true
phase 3: Scanning time: 105 minutes (comparable to phase 2), Syncing: slow, dozens of KB/s
  Global Settings:
    GOMAXPROCS=default
    maxFolderConcurrency=4
  Folder settings
    hashers=8
    copiers=8
    maxConcurrentWrites=8
    disableFsync=true # (sic!) this did NOT helped
phase 4: UNDER TESTING
  System:
    cryptsetup --perf-no_read_workqueue --perf-no_write_workqueue
    copy-on-write disabled for syncthing db folder

main guess: CPU constraint due to usermode code (syncthing logic) and slow CPU.

Two observations:

  1. syncthing is using ONLY ONE core per (syncthing) folder, for both scanning and syncing phase, so if a single CPU core is slow, while folder requires a lot of cpu usage, multiple cores do NOT help, as syncthing does not use them.
  2. there might be possible syncthing optimization (besides implementing multiple cores per folder) for directories having many files (some kind of caching/grouping/sharding) as currently looks like syncthing does some form of scan for whole 380k entries list, per single entry sync.

Also: there is rather dubious reason to have copiers/hashers configuration parameters per ‘folder’ while not using more than 1 core per folder.

Please find below vmstat during sync & scan (no substantial differences observed). Please notice:

  1. only 1 process running (while having 4 cores)
  2. small IO activity = no IO layer constraint
  3. one core fully used (US+SY ~= 25% = 1 CORE) = CPU constraint for single core code
  4. no WA = no IO constraint
  5. it is syncthing usermode code, not kernel code (fs/vm/luks) constraint, as US is >> SY
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 175436  11736 2526492    0    0    10   115    3    1 11  3 86  0  0
 1  0      0 174176  11736 2526492    0    0     0     0 1565 1895 19  7 74  0  0
 1  0      0 171404  11736 2526496    0    0     0     0 1001 1124 20  5 74  0  0
 1  0      0 173624  11752 2526480    0    0     0  1564 1225 1426 32  2 65  1  0
 1  0      0 173020  11752 2526512    0    0     0     0 1225 1658 20  6 75  0  0

Not sure about copiers, but hashers are only relevant when actually hashing files, which is the case only when Syncthing meets new or modified files in the scanning process. Otherwise, if the files haven’t changed, there is nothing to hash. Scanning on its own doesn’t necessarily include any hashing.

1 Like

Although Syncthing is impacted, and some tuning might help, it looks like an external issue…

btrfs directly on a NVMe has a much higher potential throughput than what I’ve got (mSATA SSD):

# hdparm -t /dev/sda

/dev/sda:
 Timing buffered disk reads: 1410 MB in  3.00 seconds = 469.32 MB/sec

One of my Maildir directories (3.6 GB) is only about 1/3 the file count, but all things being equal, the results should be comparable (no inline compression by btrfs or disk encryption):

time ls -1 ~/Mail/inbox | wc -l
111899

real	0m0.150s
user	0m0.106s
sys	0m0.047s

If I had 380,000 files it would take less than 1 second, so comparatively speaking 7 seconds for a directory listing means there’s significant overhead somewhere in your storage stack.

Because it’s email and your system appears to have CPU cycles to spare, if your Linux kernel supports it, enable inline compression in btrfs (use zstd and max it out at level 15). It’ll reduce the on-disk storage requirements by possibly up to 90% for emails without file attachments, but it’ll also greatly reduce the number of disk writes and speed up reads.

Also, since btrfs on top of LUKS, on top of a software RAID likely doesn’t benefit from checksums, consider disabling it to remove the extra disk writes.

(A few months ago I had a Maildir directory at work with ~1.8 million files totaling ~25GB – btrfs volume on a SSD – that was mirrored by Syncthing to a standby server. If it hadn’t been moved to a long-term archive I’d run some benchmarks for you.)

I appreciate your help @gadget !!

I already use zstd compression on this volume. Still as I see on the vmstat output, it shows where the bottleneck is: 1 core, and only 1 core is maxed. mostly by usermode code (syncthing), not kernel code (btrfs/luks/devices), and NO io constraint due to the low bi/bo numbers.

The problem is: syncthing is not concurrent in regard to one folder for scanning (scanner.walk) and (some parts of?) syncing phase during processing list of 380k entries in one directory.

Of course, having faster CPU would help it (and here lies reason why your configuration works fine), but as other 100k files is being processed ultra fast on this very machine, it shows that syncthing logic is also a bottleneck.

having ability to use all cores for these operations would speed it up by 400%. probably other, more complicated features would allow to gain more.

You could try to enable the case-sensitive FS flag in Syncthing.

1 Like