I had a case where Syncthing would always take hours to sync a specific, very large file. Via CPU profiling, I figured out that the rolling/weak hash implementation caused the receiving side (a NAS with a low-end CPU: Intel J5005 (4C/4T) @ 1.5-2.8 Ghz) to use (mostly) only one CPU core, resulting in a read rate of around 16 MB/s. Setting weakHashThresholdPct to 101 for the folder fixed the problem, as suggested by the documentation.
The way I understand it, the rolling hashing is supposed to detect shifting in large files, to avoid a re-transfer of a large number of blocks. The docs say:
If your use case doesn’t cause data to shift in a file, and if the files are large (movies, VM images, …) it is unnecessary to spend time checking weak hashes. Set the threshold to 101% to disable use of weak hashes.
This makes sense - unless the hashing is the bottleneck and not the network. Could there be a bug in the weak hash implementation, causing such a low hashrate as in my case? Does it depend on any CPU features or could this be a memory problem? Or is it just single-threaded and I’m out of luck?
Is there a benchmark for the weak hash that I can run? Syncthing only displays the SHA256 hashrate in the logs on startup.
Instead of a benchmark, I did some math on the numbers given by the usage report.
I have sha256Perf = 615.3 and hashPerf = 412.06 in MB/s. Knowing that sha56Perf > hashPerf and that the code does the hashing sequentially, I can consider the time needed to hash X MB of data and get
X/(X/hashPerf - X/sha256Perf) = adler32Perf
and therefore adler32Perf = 1247.49.
If this is correct, then the main problem indeed seems to be map access in weakhash.Find.
I ran another test with weakHashThresholdPct = -1, to make a heap profile. I didn’t see anything useful in there, but all of Syncthing’s threads were each capped at 16 MB/s during weak hashing, so this problem is not specific to my file. But this also tells me that this isn’t a memory or hard disk bottleneck. I wonder what makes go map access so dreadfully slow on a low-end CPU then. This degrades Syncthing performance significantly whenever weak hashes are used on a low-end device and in a fast local network.
I’d be interested to hear from other low-end CPU users and will try to do further tests.
The actual speed of adler32 is more or less irrelevant for the Find function. It steps through the file byte for byte, and for each byte, it grabs the (rolling) hash so far and looks it up in a map of all interesting block hashes. For a large file that’s a shitload of map lookups, and they all entail something like a short hash function of the map key etc so I’m not surprised it takes ages.
Doing some benchmarking, on my rather fast M2 I get a “find rate” of ~82 MB/s (i.e., 82 million map lookups/s plus some negligible work in hashing). Your 16 MB/s for an older CPU seems reasonable. Disabling weak hashes is the way.
I did another test with an even slower machine, where I get around 6.5 MB/s. Alright, disabling the hashes it is, maybe even for all my folders. There is no global switch for this, correct?
I’ve tried coming up with an automatic rule for enabling or disabling weak hashes, so that the end user wouldn’t have to tune this. Something like “if the connection is faster than map access, disable weak hashes”. But determining this reliably sounds like a whole bunch of additional code and not worth the extra complexity, unless Syncthing has a way of knowing if the connection is local and wired. But I’m happy with being able to solve this at all.
After some reading, maybe a bloom filter created from the hashesToFind slice could speed up the lookup. Instead of checking the hashmap for every rolling hash, we could first ask a bloom filter to let us know if the hash is not in hashesToFind, which is very often the case, and then do the hashmap lookup only if the bloom filter claims the hash is in the set. Whether this would be faster is hard to predict.
Something else I noticed: The documentation motivates rolling hashes as a way to detect if blocks have shifted to different positions when a lot of data was inserted into or removed from the file. But the code checks every offset for every block, which is much more powerful, as it also detects arbitrary permutations of blocks. Is there actually a realistic scenario where this can happen? If not, assuming that blocks which still exist did not change their relative order could allow us to look only at a small window of the blocks represented by hashesToFind, i.e., instead of doing a map lookup for the current hash, compare it to 10 consecutive elements in hashesToFind.
Don’t mind me, I’m like a dog with a bone when I encounter an optimization problem.
That explains why I have a very similar issue with rsync on the same large file and have to use the -W switch.
Rsync does pretty much the same, in a sort of “mirrored” way. With Syncthing, the recipient does the rolling hash and decides what to pull. With rsync, the sender does the rolling hash and decides what to send (roughly speaking). There’s also additional verification with a cryptographic hash. Source: - rsync.nw
Reading Rolling hash - Wikipedia, there are approaches with variable block sizes that seem to avoid computationally-intensive searching for blocks, but those aren’t an option here anyway and I’m not sure how they work.
The hashing is fast enough. Theoretically, a function that reads through the file could write the rolling hashes into a buffer, which would be read by multiple worker threads that check and update the map, but map access is not concurrent in Go, so each would need its own map and merge them later.
Parallelizing the offset map creation sounds promising to me in theory, but I don’t know how well it would fit into the codebase.
As for my idea of assuming the relative order on blocks being unchanged and checking only for a constant number of hashes - I have the feeling one can construct worst cases where this fails terribly.
That means we do a bunch more memory allocations than necessary. However from what was written about the profiles (haven’t looked myself) that’s not the part that using time, but the actual lookups. So yes, pre-allocating would be good as we know the size, but it’s not going to improve the situation here.