Weak hashing performance vs. network speed

I had a case where Syncthing would always take hours to sync a specific, very large file. Via CPU profiling, I figured out that the rolling/weak hash implementation caused the receiving side (a NAS with a low-end CPU: Intel J5005 (4C/4T) @ 1.5-2.8 Ghz) to use (mostly) only one CPU core, resulting in a read rate of around 16 MB/s. Setting weakHashThresholdPct to 101 for the folder fixed the problem, as suggested by the documentation.

The way I understand it, the rolling hashing is supposed to detect shifting in large files, to avoid a re-transfer of a large number of blocks. The docs say:

If your use case doesn’t cause data to shift in a file, and if the files are large (movies, VM images, …) it is unnecessary to spend time checking weak hashes. Set the threshold to 101% to disable use of weak hashes.

This makes sense - unless the hashing is the bottleneck and not the network. Could there be a bug in the weak hash implementation, causing such a low hashrate as in my case? Does it depend on any CPU features or could this be a memory problem? Or is it just single-threaded and I’m out of luck?

Is there a benchmark for the weak hash that I can run? Syncthing only displays the SHA256 hashrate in the logs on startup.

Here is the pprof -top output, for reference:

File: syncthing
Type: cpu
Time: Apr 10, 2023 at 11:22pm (CEST)
Duration: 90.33s, Total samples = 88.31s (97.77%)
Showing nodes accounting for 88.02s, 99.67% of 88.31s total
Dropped 62 nodes (cum <= 0.44s)
      flat  flat%   sum%        cum   cum%
    30.61s 34.66% 34.66%     43.29s 49.02%  runtime.mapaccess2_fast32
    18.90s 21.40% 56.06%     18.90s 21.40%  github.com/chmduquesne/rollinghash/adler32.(*Adler32).Roll (inline)
     8.85s 10.02% 66.09%      8.85s 10.02%  runtime.memhash32
     5.70s  6.45% 72.54%     88.22s 99.90%  github.com/syncthing/syncthing/lib/weakhash.Find
     4.43s  5.02% 77.56%      7.49s  8.48%  bufio.(*Reader).ReadByte
     3.93s  4.45% 82.01%      5.52s  6.25%  context.(*cancelCtx).Done
     3.22s  3.65% 85.65%      4.12s  4.67%  runtime.chanrecv
     2.59s  2.93% 88.59%      2.59s  2.93%  runtime/internal/syscall.Syscall6
     2.47s  2.80% 91.38%      6.59s  7.46%  runtime.selectnbrecv
        2s  2.26% 93.65%         2s  2.26%  runtime.add (inline)
     1.59s  1.80% 95.45%      1.59s  1.80%  sync/atomic.(*Value).Load (inline)
     0.90s  1.02% 96.47%      0.90s  1.02%  runtime.empty (inline)
     0.73s  0.83% 97.29%      0.73s  0.83%  github.com/chmduquesne/rollinghash/adler32.(*Adler32).Sum32 (inline)
     0.65s  0.74% 98.03%      0.92s  1.04%  runtime.(*bmap).overflow (inline)
     0.56s  0.63% 98.66%      0.56s  0.63%  runtime.bucketShift (inline)
     0.46s  0.52% 99.18%      0.79s  0.89%  runtime.(*bmap).keys (inline)
     0.15s  0.17% 99.35%      0.71s   0.8%  runtime.bucketMask (inline)
     0.13s  0.15% 99.50%      3.06s  3.47%  bufio.(*Reader).fill
     0.04s 0.045% 99.55%      2.91s  3.30%  os.(*File).read (inline)
     0.03s 0.034% 99.58%      2.82s  3.19%  syscall.read
     0.02s 0.023% 99.60%      2.93s  3.32%  os.(*File).Read
     0.02s 0.023% 99.63%      2.57s  2.91%  syscall.RawSyscall6
     0.02s 0.023% 99.65%      2.80s  3.17%  syscall.Syscall
     0.01s 0.011% 99.66%      2.87s  3.25%  internal/poll.(*FD).Read
     0.01s 0.011% 99.67%      2.84s  3.22%  internal/poll.ignoringEINTRIO (inline)
         0     0% 99.67%     88.22s 99.90%  github.com/syncthing/syncthing/lib/model.(*sendReceiveFolder).copierRoutine
         0     0% 99.67%     88.22s 99.90%  github.com/syncthing/syncthing/lib/model.(*sendReceiveFolder).initWeakHashFinder
         0     0% 99.67%     88.22s 99.90%  github.com/syncthing/syncthing/lib/model.(*sendReceiveFolder).pullerIteration.func2
         0     0% 99.67%     88.22s 99.90%  github.com/syncthing/syncthing/lib/weakhash.NewFinder
         0     0% 99.67%      2.82s  3.19%  syscall.Read (inline)

Rolling is state dependant, so I’m not certain you can parallelize it.

The rolling is a simple byte in byte out operation, there are no special cpu instructions for that as far as I know, and I suspect it’s purely a product of the clock speed.

You can probably write your own benchmark using the same library to see where its spending time.

Also, from the trace it looks like it spends most time in Find which I think just does general map access in go (on the phone so don’t take it for gospel)

Instead of a benchmark, I did some math on the numbers given by the usage report.

I have sha256Perf = 615.3 and hashPerf = 412.06 in MB/s. Knowing that sha56Perf > hashPerf and that the code does the hashing sequentially, I can consider the time needed to hash X MB of data and get

X/(X/hashPerf - X/sha256Perf) = adler32Perf

and therefore adler32Perf = 1247.49.

If this is correct, then the main problem indeed seems to be map access in weakhash.Find.

I ran another test with weakHashThresholdPct = -1, to make a heap profile. I didn’t see anything useful in there, but all of Syncthing’s threads were each capped at 16 MB/s during weak hashing, so this problem is not specific to my file. But this also tells me that this isn’t a memory or hard disk bottleneck. I wonder what makes go map access so dreadfully slow on a low-end CPU then. This degrades Syncthing performance significantly whenever weak hashes are used on a low-end device and in a fast local network.

I’d be interested to hear from other low-end CPU users and will try to do further tests.

The actual speed of adler32 is more or less irrelevant for the Find function. It steps through the file byte for byte, and for each byte, it grabs the (rolling) hash so far and looks it up in a map of all interesting block hashes. For a large file that’s a shitload of map lookups, and they all entail something like a short hash function of the map key etc so I’m not surprised it takes ages.

Doing some benchmarking, on my rather fast M2 I get a “find rate” of ~82 MB/s (i.e., 82 million map lookups/s plus some negligible work in hashing). Your 16 MB/s for an older CPU seems reasonable. Disabling weak hashes is the way.

I did another test with an even slower machine, where I get around 6.5 MB/s. Alright, disabling the hashes it is, maybe even for all my folders. There is no global switch for this, correct?

I’ve tried coming up with an automatic rule for enabling or disabling weak hashes, so that the end user wouldn’t have to tune this. Something like “if the connection is faster than map access, disable weak hashes”. But determining this reliably sounds like a whole bunch of additional code and not worth the extra complexity, unless Syncthing has a way of knowing if the connection is local and wired. But I’m happy with being able to solve this at all.

After some reading, maybe a bloom filter created from the hashesToFind slice could speed up the lookup. Instead of checking the hashmap for every rolling hash, we could first ask a bloom filter to let us know if the hash is not in hashesToFind, which is very often the case, and then do the hashmap lookup only if the bloom filter claims the hash is in the set. Whether this would be faster is hard to predict.

Something else I noticed: The documentation motivates rolling hashes as a way to detect if blocks have shifted to different positions when a lot of data was inserted into or removed from the file. But the code checks every offset for every block, which is much more powerful, as it also detects arbitrary permutations of blocks. Is there actually a realistic scenario where this can happen? If not, assuming that blocks which still exist did not change their relative order could allow us to look only at a small window of the blocks represented by hashesToFind, i.e., instead of doing a map lookup for the current hash, compare it to 10 consecutive elements in hashesToFind.

Don’t mind me, I’m like a dog with a bone when I encounter an optimization problem. :slight_smile:

The idea was simply taken from rsync. I assume how rsync does it is a bit different however.

Perhaps bloomfilter would help, but it needs benchmarking.

That explains why I have a very similar issue with rsync on the same large file and have to use the -W switch.

Rsync does pretty much the same, in a sort of “mirrored” way. With Syncthing, the recipient does the rolling hash and decides what to pull. With rsync, the sender does the rolling hash and decides what to send (roughly speaking). There’s also additional verification with a cryptographic hash. Source: - rsync.nw

Reading Rolling hash - Wikipedia, there are approaches with variable block sizes that seem to avoid computationally-intensive searching for blocks, but those aren’t an option here anyway and I’m not sure how they work.

Not a go dev but the offsets map is created without a size. No idea if that has any performance implications?

Would it be worth to do the hashing using multiple threads or is this already parallelized at a higher level?

I don’t think you can specify a size for maps.

The hashing is fast enough. Theoretically, a function that reads through the file could write the rolling hashes into a buffer, which would be read by multiple worker threads that check and update the map, but map access is not concurrent in Go, so each would need its own map and merge them later.

Parallelizing the offset map creation sounds promising to me in theory, but I don’t know how well it would fit into the codebase.

As for my idea of assuming the relative order on blocks being unchanged and checking only for a constant number of hashes - I have the feeling one can construct worst cases where this fails terribly.

A new, empty map value is made using the built-in function make, which takes the map type and an optional capacity hint as arguments

But again, no idea if this has any measureable effect at all.

That means we do a bunch more memory allocations than necessary. However from what was written about the profiles (haven’t looked myself) that’s not the part that using time, but the actual lookups. So yes, pre-allocating would be good as we know the size, but it’s not going to improve the situation here.