Hello everyone,
I have observed an interesting behavior since I am running on v1.10 on my two biggest shared folders.
The scanning times increased a lot as it is going very slowly through the file system and take a couple of days to complete (initial scan and continuing scans). Log sample for shared folder tmc-vc:
Oct 20 06:20:22 syncthingsrv1 syncthing_tmc[6770]: [WWQT1F] 2020/10/20 06:20:22.451065 walk.go:177: DEBUG: Walk tmc-vc [] current progress 9698748137/12424138566 at 0.0 MiB/s (78%)
Oct 20 06:20:24 syncthingsrv1 syncthing_tmc[6770]: [WWQT1F] 2020/10/20 06:20:24.450609 walk.go:177: DEBUG: Walk tmc-vc [] current progress 9698748137/12424138566 at 0.0 MiB/s (78%)
Oct 20 06:20:26 syncthingsrv1 syncthing_tmc[6770]: [WWQT1F] 2020/10/20 06:20:26.450950 walk.go:177: DEBUG: Walk tmc-vc [] current progress 9698748137/12424138566 at 0.0 MiB/s (78%)
Oct 20 06:20:28 syncthingsrv1 syncthing_tmc[6770]: [WWQT1F] 2020/10/20 06:20:28.451001 walk.go:177: DEBUG: Walk tmc-vc [] current progress 9698748137/12424138566 at 0.0 MiB/s (78%)
Oct 20 06:20:30 syncthingsrv1 syncthing_tmc[6770]: [WWQT1F] 2020/10/20 06:20:30.451323 walk.go:177: DEBUG: Walk tmc-vc [] current progress 9698748137/12424138566 at 0.0 MiB/s (78%)
Oct 20 06:20:32 syncthingsrv1 syncthing_tmc[6770]: [WWQT1F] 2020/10/20 06:20:32.451287 walk.go:177: DEBUG: Walk tmc-vc [] current progress 9698750223/12424138566 at 0.0 MiB/s (78%)
Oct 20 06:20:34 syncthingsrv1 syncthing_tmc[6770]: [WWQT1F] 2020/10/20 06:20:34.451049 walk.go:177: DEBUG: Walk tmc-vc [] current progress 9698750223/12424138566 at 0.0 MiB/s (78%)
# from 06:20:32 till 06:21:06 it seems to stuck, than
Oct 20 06:21:06 syncthingsrv1 syncthing_tmc[6770]: [WWQT1F] 2020/10/20 06:21:06.450766 walk.go:177: DEBUG: Walk tmc-vc [] current progress 9698751266/12424138566 at 0.0 MiB/s (78%)
Oct 20 06:21:08 syncthingsrv1 syncthing_tmc[6770]: [WWQT1F] 2020/10/20 06:21:08.451085 walk.go:177: DEBUG: Walk tmc-vc [] current progress 9698751266/12424138566 at 0.0 MiB/s (78%)
Oct 20 06:21:10 syncthingsrv1 syncthing_tmc[6770]: [WWQT1F] 2020/10/20 06:21:10.450562 walk.go:177: DEBUG: Walk tmc-vc [] current progress 9698751266/12424138566 at 0.0 MiB/s (78%)
# than it stuck again for nearly 30s and so on
From the webinterface where the “Scan Time Remaining” is jumping between 12 hours and > 1 month:
On the another server which is on software level the exactly same, it is working with out any issues. The only difference is, that the working one is on running syncthing version 1.5 with 20G of RAM. In this version the servers were able to scan the full shared folder within 2 - 4 hours. In one forum post there were some talking about to less ram, so I increased it on the affected hosts to 32G.
Server specs
OS: Debian 10
CPU Cores: 8
Mem: 32G
FS: xfs inside of drbd, read: IOPS=55.0k, BW=215MiB/s (225MB/s)(3118MiB/14509msec)
Resources used
Syncthing uses 25.3% of the RAM and has a CPU usage (overall) between 480% and 620% and hase a nice level of 11 (which is default) and prio 31
Downgrade test
I have performed a downgrade for testing if it recovers and it did. So when it was back to version 1.5 it performs the scans way faster, after the upgrade to 1.10 again, it slows down.
Question
What can I do there to improve the scanning time again?
Yes, that is right, we are using the debian packages.
I was on the 1.9 as well and had there similar issues, but had to downgrade it again very quickly so no time for testing/analyzing
No need to test anymore then, my suspected culprit was added in v1.9.0 which matches your experience. And the cpu profile corroborates it: The added case-insensitivity safety mechanism uses almost all of the cpu time. Compared to another report about slowdown it isn’t filesystem access here, but looping/in memory operations that use the time.
Anyway the same config to fix it for you and info that might help us improve it for everyone as in the other thread applies:
Of note in the profile: Operations through db walking and osutil.IsDeleted result in 46s spent in “case-code” out of total 88s, while for filesystem walking the “case-code” isn’t even listed as consuming any time at all. Walking the fs we process all items along the filesystem tree then alphabetically, walking the db it’s directly alphabetically. I don’t understand how that would cause the discrepancy though. Needs some more thought.
Thanks for the fast first analyse.
Well, I guess you mean that I just should just set caseSensitiveFS=true on the syncthing services right?
Do you want me to also provide some numbers, about files and structure as you have asked in the linked post?
Sorry for not having been clear: Yes, to make your setup as fast as before, set caseSensitiveFS to true. And yes, to potentially gain insight why you experience such a strong slowdown (that’s unexpected) and thus find something to optimize, please also provide this info on files and structure.
Thanks for the info! That’s consistent with the cpu info: You have a huge amount of files per directory, and for every item in the directory we need to compare its name with all of the others, which just takes a lot of time.
The cpu profile would be good to see. Even if it might not contain viable info to improve the current situation, it can at least show if the case setting did take effect or something went wrong there.
caseSensitiveFS option is working. Now most time is spent in db and matching ignores. The total cpu time used in the 30s profiling interval is virtually the same as before. Looks like that wasn’t the culprit in the first place.
The usual suspect for the bottleneck would be storage, however that should have the same effect on 1.5 and 1.10. Still just to know: Is there significant seeking/iowait? Is the db on the same storage as the folders/data? Does changing maxFolderConcurrency to 1 make any difference?