Trying to Sync a folder (1 TB) with subfolders (around 2mio files per folder).
After 10 days its still scanning. If I activate the debug log (scanner), there is a log line every 20sec or so about a new file hashed. Why is it so slow? Is there any way to speed it up?
It seems to take way too long to hash each file. 20 sec per file mean 3 files per minute, 180 files per hour, and only 4320 files per day. With this kind of speed, it’s going to take forever to scan all the files.
@xor-gate Syncthing-macos is distributing arm builds on M1s, right? FS access on the compat layer is known to be ridiculously slow (ok “is known” - it is for docker and that’s all I know).
@syncer What did the log at startup say about hashing speed?
2022-10-06 19:44:17 Single thread SHA256 performance is 1073 MB/s using crypto/sha256 (688 MB/s using minio/sha256-simd).
2022-10-06 19:44:18 Hashing performance is 347.74 MB/s
2022-10-06 19:44:18 Overall send rate is unlimited, receive rate is unlimited
What’s the CPU, memory and disk usage like? Grabbing two cpu profiles and if memory usage is significant also a memory profile would show where it spends time: Profiling — Syncthing v1.22.0 documentation For big folders the metadata checking might take a lot of of resources (and that could be disabled), but given you are already at the hashing stage that doesn’t apply here.
That’s a lot of files. I’m going to assume that you mean Syncthing folders, not that you have a single folder on disk with two million files in. Regardless, even you have folders with “just” thousands of files in them, directory access tends to be very slow when there are a lot of files in a folder. We do a lot of directory accesses.
Maybe you can trick Syncthing by gradually adding more files? From my experience once the initial scan is finished Syncthing works ok(-ish) with a huge number of files
Are those files located on a spinning disk? I’ve done a local test in Windows with a single folder that contained exactly 1,000,000 files. It took less than an hour to scan it with a 4-core 8-thread Ryzen 4350G CPU. However, the folder in my case was located on a RAM disk. Just in case yours is located on an HDD, I’d strongly suggest to try using at least a decent SSD instead and then see how long it takes to scan it in Syncthing.
Just for the record, Scan Progress Interval was set to -1 (i.e. disabled).
It’s not an IO thing. They said they use the internal storage in the M1 MacBook, which is really speedy. It’s that listdir on a directory with a million files takes ages, and we do a lot those for case insensitivity lookups.
Here’s an example from my M1 Ultra, internal storage:
jb@sep:~ % mkdir tmp/large
jb@sep:~ % for ((i=0; i<1000000; i++)); do echo tmp/large/file-$i ; done | xargs touch
jb@sep:~ % time ls tmp/large | wc -l
1000000
ls -F tmp/large 7.11s user 5.18s system 64% cpu 19.044 total
wc -l 0.00s user 0.00s system 0% cpu 19.043 total
It takes 19 seconds just to list the names in the directory. We do this at minimum once per file we scan. Sure, we cache the result, but for like 5 seconds, so that’s going to expire until we need it next time…
Certainly some other software may do this better, or at least with different tradeoffs between accuracy and performance than we have, but it’s a well known issue best avoided by a more reasonable directory structure.
Sounds like we should increase those timeouts for large folders if the memory usage doesn’t completely explode? As a user i’d expect higher resource usage for these kind of workloads.
Probably wouldn’t hurt. The cache time is short because the lookups prevent data loss, but maybe the cache time should be max(5 sec, 5 * lookup time) or something.