I’ve added and tweaked a benchmark run to the integration tests, that transfers the exact same set of files each time and measures the time it takes. That could be used. But I’m not really disputing that running without crypto (or with weak crypto) is faster than doing real crypto…
The claim of different file sizes having different speeds is the one thst intrigues me. Plus it’s interesting to see where the bottlenecks are in general in his case.
Receiver: 100MB all files empty - no data 1GB:
(pprof) top
81.35s of 81.74s total (99.52%)
Dropped 17 nodes (cum <= 0.41s)
Showing top 10 nodes out of 24 (cum >= 1.43s)
flat flat% sum% cum cum%
50.56s 61.85% 61.85% 50.56s 61.85% crypto/aes.encryptBlockGo
26.09s 31.92% 93.77% 26.09s 31.92% crypto/cipher.(*gcm).mul
2.50s 3.06% 96.83% 2.50s 3.06% crypto/cipher.getUint64
1.25s 1.53% 98.36% 1.25s 1.53% scanblock
0.45s 0.55% 98.91% 0.45s 0.55% crypto/cipher.fastXORWords
0.41s 0.5% 99.41% 29s 35.48% crypto/cipher.(*gcm).updateBlocks
0.05s 0.061% 99.47% 50.64s 61.95% crypto/aes.(*aesCipher).Encrypt
0.03s 0.037% 99.51% 50.59s 61.89% crypto/aes.encryptBlock
0.01s 0.012% 99.52% 51.10s 62.52% crypto/cipher.(*gcm).counterCrypt
0 0% 99.52% 1.43s 1.75% GC
5GB:
447.07s of 456.51s total (97.93%)
Dropped 392 nodes (cum <= 2.28s)
Showing top 10 nodes out of 32 (cum >= 274.18s)
flat flat% sum% cum cum%
270.42s 59.24% 59.24% 270.42s 59.24% crypto/aes.encryptBlockGo
142.48s 31.21% 90.45% 142.48s 31.21% crypto/cipher.(*gcm).mul
13.33s 2.92% 93.37% 13.33s 2.92% crypto/cipher.getUint64
7.99s 1.75% 95.12% 7.99s 1.75% scanblock
5.92s 1.30% 96.41% 5.92s 1.30% crypto/sha256.block
2.88s 0.63% 97.04% 158.60s 34.74% crypto/cipher.(*gcm).updateBlocks
2.67s 0.58% 97.63% 2.67s 0.58% crypto/cipher.fastXORWords
0.71s 0.16% 97.79% 271.13s 59.39% crypto/aes.encryptBlock
0.48s 0.11% 97.89% 271.61s 59.50% crypto/aes.(*aesCipher).Encrypt
0.19s 0.042% 97.93% 274.18s 60.06% crypto/cipher.(*gcm).counterCrypt
Sender: 100MB:
47.57s of 51.07s total (93.15%)
Dropped 133 nodes (cum <= 0.26s)
Showing top 10 nodes out of 36 (cum >= 0.90s)
flat flat% sum% cum cum%
22.53s 44.12% 44.12% 22.53s 44.12% crypto/aes.encryptBlockG
17.59s 34.44% 78.56% 17.59s 34.44% crypto/cipher.(*gcm).mul
2.16s 4.23% 82.79% 2.16s 4.23% crypto/cipher.getUint64
1.52s 2.98% 85.76% 21.27s 41.65% crypto/cipher.(*gcm).upd
0.96s 1.88% 87.64% 0.96s 1.88% scanblock
0.74s 1.45% 89.09% 0.74s 1.45% runtime.MSpan_Sweep
0.73s 1.43% 90.52% 23.26s 45.55% crypto/aes.encryptBlock
0.67s 1.31% 91.83% 0.67s 1.31% crypto/cipher.fastXORWor
0.44s 0.86% 92.70% 0.44s 0.86% crypto/cipher.gcmInc32
0.23s 0.45% 93.15% 0.90s 1.76% crypto/cipher.xorWords
1GB:
41.38s of 52.52s total (78.79%)
Dropped 133 nodes (cum <= 0.26s)
Showing top 10 nodes out of 46 (cum >= 1.45s)
flat flat% sum% cum cum%
15.88s 30.24% 30.24% 15.88s 30.24% crypto/aes.encryptBlockGo
10.25s 19.52% 49.75% 10.25s 19.52% crypto/cipher.(*gcm).mul
2.64s 5.03% 54.78% 6.24s 11.88% github.com/calmh/xdr.(*Reader).ReadUint32
2.19s 4.17% 58.95% 2.19s 4.17% scanblock
2.03s 3.87% 62.81% 3.45s 6.57% io.ReadAtLeast
1.97s 3.75% 66.57% 1.99s 3.79% runtime.mallocgc
1.82s 3.47% 70.03% 9.68s 18.43% github.com/calmh/xdr.(*Reader).ReadBytesMaxInto
1.62s 3.08% 73.12% 1.62s 3.08% runtime.MSpan_Sweep
1.54s 2.93% 76.05% 4.99s 9.50% io.ReadFull
1.44s 2.74% 78.79% 1.45s 2.76% crypto/cipher.getUint64
5GB:
837.92s of 1130.55s total (74.12%)
Dropped 419 nodes (cum <= 5.65s)
Showing top 10 nodes out of 59 (cum >= 58.49s)
flat flat% sum% cum cum%
163.01s 14.42% 14.42% 163.01s 14.42% scanblock
116.94s 10.34% 24.76% 116.94s 10.34% crypto/aes.encryptBlockGo
90.28s 7.99% 32.75% 228.26s 20.19% github.com/calmh/xdr.(*Reader).ReadUint32
83.94s 7.42% 40.17% 142.93s 12.64% io.ReadAtLeast
78.10s 6.91% 47.08% 78.72s 6.96% runtime.mallocgc
71.81s 6.35% 53.43% 71.81s 6.35% crypto/cipher.(*gcm).mul
64.48s 5.70% 59.14% 64.48s 5.70% runtime.MSpan_Sweep
62.50s 5.53% 64.66% 338.13s 29.91% github.com/calmh/xdr.(*Reader).ReadBytesMaxInto
54.64s 4.83% 69.50% 197.58s 17.48% io.ReadFull
52.22s 4.62% 74.12% 58.49s 5.17% bytes.(*Reader).Read
I am not sure if the profiles are of multiple runs or of a single run (since the profiler produces multiple files during the runtime), I assumed it’s a single run and checked only the last files.
It seems that we are very inefficient with memory management on the sender with larger files causing load with memory allocations which is now: https://github.com/syncthing/syncthing/issues/1157
Thanks for testing
I’m not worried about the crypto… we can switch to faster algorithms if we need to like EC-DH and ChaCha20 , which guarantee PFS and are at least 2x faster than AES etc. Or better crypto libraries. The crypto is only gonna be about 3% to 15% of the overhead, but probably closer to 5% with AES hardware acceleration.
The only latency is during keypair generation and session key agreement, not transfer. So I don’t think the crypto is slowing transfers too much.
The slow speeds are a few things… (1) No parallel transfer / saturation of bandwidth to optimum … (2) Lack of efficient multithreading , (3) Lack of a dedicated & distributed cache for writes() , (4) Lack of file locks and signalling, (5) Perhaps we need to check UDP as well, as it’s generalyl faster.
Oh also, the block size needs to be (A) configurable, (B) block id not random, should be ‘rarest first’, © other nodes should share blocks, and (D) we need filesystem locking during changes.
So there’s alot to do… most of it involves testing.