Poor performance on big file

AudriusButkevicius · January 3, 2017, 2:09pm

Weak hash?

calmh · January 3, 2017, 2:10pm

Correct, it’s the weakhash. Comment that out and I’m back in business. I haven’t figured out why yet. It’s supposed to be faster, I think?

No, it’s slow. Very slow. Ouch.

func BenchmarkWeakHash(b *testing.B) {
	const size = 16384
	data := make([]byte, size)
	hf := NewHash(protocol.BlockSize)

	for i := 0; i < b.N; i++ {
		hf.Write(data)
	}

	_ = hf.Sum32()
	b.SetBytes(size)
}

BenchmarkWeakHash-8   5000   251595 ns/op   65.12 MB/s

I can make it about 20% faster by tweaking variables to help the compiler allocate registers, but that’s still an order of magnitude too slow for it to be fun…

calmh · January 3, 2017, 3:01pm

@wucke13 you could compare this build if you like; https://build.syncthing.net/job/syncthing-branch/357/artifact/

The hashing should obviously be faster, but I hope the rest of the behavior is identical (if you deploy it on the receiving side as well).

wucke13 · January 3, 2017, 3:48pm

The displayed hashing perormance did not change a lot, little increase.

[ETMJ5] 16:36:30 INFO: Single thread hash performance is 159 MB/s using minio/sha256-simd (117 MB/s using crypto/sha256).

From a subjective view the Scanning goes faster. The guessed time to scan now dropped to ~ 14m 20s, which seems fine. Like half the time. Syncthing CPU Usage dropped to ~4%, which is half the usage from before. So, your change doubled the speed, and halfed the CPU usage on this scan. Looks like a pretty good improvement to me! When will it make it’s way in the official release?

EDIT: The jumping scan percentage seems to be reproducible by changing the scan interval during a running scan, but most of the time the scan started after changing interval takes over after a few seconds. I still don’t know how to reproduce it so that both scans keep on going all the time?

calmh · January 3, 2017, 3:54pm

Thanks, that’s good data. Changing the scan interval indeed restarts the folder, which could lead to parallel scans (although it’s a bug).

wucke13 · January 3, 2017, 4:00pm

Another thing about the sync speed: what speed should I get syncing a lot (80GB) of little files (2-10MB) via Gbit LAN? I never get over 30MiB/s and averaging around 20MiB/s doing that. It increases by around 5 MiB/s when setting both to 128 Pullers, but that is still less than half the speed which this connection is capable of. Normal behavior?

EDIT: that (128 pullers) causes uncomfortable spikes in CPU usage, even on the 12 core machine.

AudriusButkevicius · January 3, 2017, 5:59pm

Try disabling weak hash from advanced options and increase pullers. Syncing many small files most of the overhead is managing the file, so 20-30MiB probably makes sense.

calmh · January 3, 2017, 6:44pm

Yes. Those 300 Mbps or so are about what you can expect, given crypto and hashing and overhead and stuff.

AudriusButkevicius · January 3, 2017, 6:55pm

diff --git a/lib/weakhash/benchmark_test.go b/lib/weakhash/benchmark_test.go
index 79f5ce9..2f0c2ba 100644
--- a/lib/weakhash/benchmark_test.go
+++ b/lib/weakhash/benchmark_test.go
@@ -9,9 +9,14 @@ package weakhash
 import (
        "os"
        "testing"
+
+       "github.com/chmduquesne/rollinghash/adler32"
+       "github.com/chmduquesne/rollinghash/buzhash"
+       "github.com/chmduquesne/rollinghash/rabinkarp32"
 )

 const testFile = "../model/testdata/~syncthing~file.tmp"
+const size = 128 << 10

 func BenchmarkFind1MFile(b *testing.B) {
        b.ReportAllocs()
@@ -28,3 +33,51 @@ func BenchmarkFind1MFile(b *testing.B) {
                fd.Close()
        }
 }
+
+func BenchmarkWeakHashOurs(b *testing.B) {
+       data := make([]byte, size)
+       hf := NewHash(size)
+
+       for i := 0; i < b.N; i++ {
+               hf.Write(data)
+       }
+
+       _ = hf.Sum32()
+       b.SetBytes(size)
+}
+
+func BenchmarkWeakHashRabinKarp(b *testing.B) {
+       data := make([]byte, size)
+       hf := rabinkarp32.New()
+
+       for i := 0; i < b.N; i++ {
+               hf.Write(data)
+       }
+
+       _ = hf.Sum32()
+       b.SetBytes(size)
+}
+
+func BenchmarkWeakHashAdler32(b *testing.B) {
+       data := make([]byte, size)
+       hf := adler32.New()
+
+       for i := 0; i < b.N; i++ {
+               hf.Write(data)
+       }
+
+       _ = hf.Sum32()
+       b.SetBytes(size)
+}
+
+func BenchmarkWeakHashBuzHash(b *testing.B) {
+       data := make([]byte, size)
+       hf := buzhash.New()
+
+       for i := 0; i < b.N; i++ {
+               hf.Write(data)
+       }
+
+       _ = hf.Sum32()
+       b.SetBytes(size)
+}

C:/go/bin/go.exe test -test.bench=.* [C:/gohome/src/github.com/syncthing/syncthing/lib/weakhash]
BenchmarkFind1MFile-4          	      20	  64061805 ns/op	  16.37 MB/s	 6952658 B/op	  917549 allocs/op
BenchmarkWeakHashOur-4         	    1000	   2220122 ns/op	  59.04 MB/s
BenchmarkWeakHashRabinKarp-4   	    3000	    513630 ns/op	 255.19 MB/s
BenchmarkWeakHashAdler32-4     	   10000	    108073 ns/op	1212.81 MB/s
BenchmarkWeakHashBuzHash-4     	    3000	    396835 ns/op	 330.29 MB/s
PASS
ok  	github.com/syncthing/syncthing/lib/weakhash	7.768s

AudriusButkevicius · January 3, 2017, 6:56pm

Should we switch to adler32?

calmh · January 3, 2017, 7:35pm

Indeed we should. I think we should be able to do so directly and seamlessly as a weakhash mismatch doesn’t really matter.

I did a

diff --git a/lib/scanner/blocks.go b/lib/scanner/blocks.go
index 89a5da6..073e0c5 100644
--- a/lib/scanner/blocks.go
+++ b/lib/scanner/blocks.go
@@ -27,6 +27,7 @@ func Blocks(r io.Reader, blocksize int, sizehint int64, counter Counter) ([]prot
        hf := sha256.New()
        hashLength := hf.Size()
        whf := weakhash.NewHash(blocksize)
+       combinedHF := io.MultiWriter(hf, whf)
 
        var blocks []protocol.BlockInfo
        var hashes, thisHash []byte
@@ -46,7 +47,7 @@ func Blocks(r io.Reader, blocksize int, sizehint int64, counter Counter) ([]prot
        var offset int64
        for {
                lr := io.LimitReader(r, int64(blocksize))
-               n, err := io.CopyBuffer(hf, io.TeeReader(lr, whf), buf)
+               n, err := io.CopyBuffer(combinedHF, lr, buf)
                if err != nil {
                        return nil, err
                }

to get the right buffering towards the weakhash, it had a minor positive performance impact as well.

AudriusButkevicius · January 3, 2017, 7:36pm

We might cause weak hash collisions by switching over, but that should be fine.

calmh · January 3, 2017, 7:36pm

Yeah, exactly. Dgryski has another ten hash functions or so over here, but I haven’t checked if any are rolling and fast: https://github.com/dgryski?utf8=✓&tab=repositories&q=hash&type=&language=

AudriusButkevicius · January 3, 2017, 7:52pm

diff --git a/lib/weakhash/benchmark_test.go b/lib/weakhash/benchmark_test.go
index 79f5ce9..1ab3cb6 100644
--- a/lib/weakhash/benchmark_test.go
+++ b/lib/weakhash/benchmark_test.go
@@ -9,9 +9,14 @@ package weakhash
 import (
        "os"
        "testing"
+
+       "github.com/chmduquesne/rollinghash/adler32"
+       "github.com/chmduquesne/rollinghash/buzhash"
+       "github.com/chmduquesne/rollinghash/rabinkarp32"
 )

 const testFile = "../model/testdata/~syncthing~file.tmp"
+const size = 128 << 10

 func BenchmarkFind1MFile(b *testing.B) {
        b.ReportAllocs()
@@ -28,3 +33,163 @@ func BenchmarkFind1MFile(b *testing.B) {
                fd.Close()
        }
 }
+
+func BenchmarkFind1MFileRabinKarp(b *testing.B) {
+       b.ReportAllocs()
+       b.SetBytes(1 << 20)
+       for i := 0; i < b.N; i++ {
+               fd, err := os.Open(testFile)
+               if err != nil {
+                       b.Fatal(err)
+               }
+               _, err = FindCustom(fd, []uint32{0, 1, 2}, 128<<10, rabinkarp32.New)
+               if err != nil {
+                       b.Fatal(err)
+               }
+               fd.Close()
+       }
+}
+
+func BenchmarkFind1MFileAdler32(b *testing.B) {
+       b.ReportAllocs()
+       b.SetBytes(1 << 20)
+       for i := 0; i < b.N; i++ {
+               fd, err := os.Open(testFile)
+               if err != nil {
+                       b.Fatal(err)
+               }
+               _, err = FindCustom(fd, []uint32{0, 1, 2}, 128<<10, adler32.New)
+               if err != nil {
+                       b.Fatal(err)
+               }
+               fd.Close()
+       }
+}
+
+func BenchmarkFind1MFileBuzHash(b *testing.B) {
+       b.ReportAllocs()
+       b.SetBytes(1 << 20)
+       for i := 0; i < b.N; i++ {
+               fd, err := os.Open(testFile)
+               if err != nil {
+                       b.Fatal(err)
+               }
+               _, err = FindCustom(fd, []uint32{0, 1, 2}, 128<<10, buzhash.New)
+               if err != nil {
+                       b.Fatal(err)
+               }
+               fd.Close()
+       }
+}
+
+func BenchmarkWeakHashOurs(b *testing.B) {
+       data := make([]byte, size)
+       hf := NewHash(size)
+
+       for i := 0; i < b.N; i++ {
+               hf.Write(data)
+       }
+
+       _ = hf.Sum32()
+       b.SetBytes(size)
+}
+
+func BenchmarkWeakHashRabinKarp(b *testing.B) {
+       data := make([]byte, size)
+       hf := rabinkarp32.New()
+
+       for i := 0; i < b.N; i++ {
+               hf.Write(data)
+       }
+
+       _ = hf.Sum32()
+       b.SetBytes(size)
+}
+
+func BenchmarkWeakHashAdler32(b *testing.B) {
+       data := make([]byte, size)
+       hf := adler32.New()
+
+       for i := 0; i < b.N; i++ {
+               hf.Write(data)
+       }
+
+       _ = hf.Sum32()
+       b.SetBytes(size)
+}
+
+func BenchmarkWeakHashBuzHash(b *testing.B) {
+       data := make([]byte, size)
+       hf := buzhash.New()
+
+       for i := 0; i < b.N; i++ {
+               hf.Write(data)
+       }
+
+       _ = hf.Sum32()
+       b.SetBytes(size)
+}
+
+func BenchmarkWeakHashOursRoll(b *testing.B) {
+       data := make([]byte, size)
+       hf := NewHash(size)
+       hf.Write(data)
+
+       b.ResetTimer()
+
+       for i := 0; i < b.N; i++ {
+               for i := 0; i <= size; i++ {
+                       hf.Write([]byte{'a'})
+               }
+       }
+
+       b.SetBytes(size)
+}
+
+func BenchmarkWeakHashRabinKarpRoll(b *testing.B) {
+       data := make([]byte, size)
+       hf := rabinkarp32.New()
+       hf.Write(data)
+
+       b.ResetTimer()
+
+       for i := 0; i < b.N; i++ {
+               for i := 0; i <= size; i++ {
+                       hf.Roll('a')
+               }
+       }
+
+       b.SetBytes(size)
+}
+
+func BenchmarkWeakHashAdler32Roll(b *testing.B) {
+       data := make([]byte, size)
+       hf := adler32.New()
+       hf.Write(data)
+
+       b.ResetTimer()
+
+       for i := 0; i < b.N; i++ {
+               for i := 0; i <= size; i++ {
+                       hf.Roll('a')
+               }
+       }
+
+       b.SetBytes(size)
+}
+
+func BenchmarkWeakHashBuzHashRoll(b *testing.B) {
+       data := make([]byte, size)
+       hf := buzhash.New()
+       hf.Write(data)
+
+       b.ResetTimer()
+
+       for i := 0; i < b.N; i++ {
+               for i := 0; i <= size; i++ {
+                       hf.Roll('a')
+               }
+       }
+
+       b.SetBytes(size)
+}
diff --git a/lib/weakhash/weakhash.go b/lib/weakhash/weakhash.go
index 5c210a5..9411b54 100644
--- a/lib/weakhash/weakhash.go
+++ b/lib/weakhash/weakhash.go
@@ -11,6 +11,8 @@ import (
        "hash"
        "io"
        "os"
+
+       "github.com/chmduquesne/rollinghash"
 )

 const (
@@ -71,6 +73,53 @@ func Find(ir io.Reader, hashesToFind []uint32, size int) (map[uint32][]int64, er
        return offsets, nil
 }

+// Find finds all the blocks of the given size within io.Reader that matches
+// the hashes provided, and returns a hash -> slice of offsets within reader
+// map, that produces the same weak hash.
+func FindCustom(ir io.Reader, hashesToFind []uint32, size int, hashMaker func() rollinghash.Hash32) (map[uint32][]int64, error) {
+       if ir == nil {
+               return nil, nil
+       }
+
+       r := bufio.NewReader(ir)
+       hf := hashMaker()
+
+       n, err := io.CopyN(hf, r, int64(size))
+       if err == io.EOF {
+               return nil, nil
+       }
+       if err != nil {
+               return nil, err
+       }
+       if n != int64(size) {
+               return nil, io.ErrShortBuffer
+       }
+
+       offsets := make(map[uint32][]int64)
+       for _, hashToFind := range hashesToFind {
+               offsets[hashToFind] = nil
+       }
+
+       var i int64
+       var hash uint32
+       for {
+               hash = hf.Sum32()
+               if existing, ok := offsets[hash]; ok {
+                       offsets[hash] = append(existing, i)
+               }
+               i++
+
+               bt, err := r.ReadByte()
+               if err == io.EOF {
+                       break
+               } else if err != nil {
+                       return offsets, err
+               }
+               hf.Roll(bt)
+       }
+       return offsets, nil
+}
+

BenchmarkFind1MFile-4              	      20	  59880225 ns/op	  17.51 MB/s	 6952674 B/op	  917549 allocs/op
BenchmarkFind1MFileRabinKarp-4     	      50	  28291430 ns/op	  37.06 MB/s	  168810 B/op	      18 allocs/op
BenchmarkFind1MFileAdler32-4       	      50	  33253970 ns/op	  31.53 MB/s	  168816 B/op	      21 allocs/op
BenchmarkFind1MFileBuzHash-4       	      30	  43864000 ns/op	  23.91 MB/s	14800502 B/op	      52 allocs/op
BenchmarkWeakHashOurs-4            	    1000	   2108324 ns/op	  62.17 MB/s
BenchmarkWeakHashRabinKarp-4       	    3000	    468214 ns/op	 279.94 MB/s
BenchmarkWeakHashAdler32-4         	   20000	     89862 ns/op	1458.58 MB/s
BenchmarkWeakHashBuzHash-4         	    5000	    368838 ns/op	 355.36 MB/s
BenchmarkWeakHashOursRoll-4        	     300	   4396461 ns/op	  29.81 MB/s
BenchmarkWeakHashRabinKarpRoll-4   	    2000	   1029349 ns/op	 127.33 MB/s
BenchmarkWeakHashAdler32Roll-4     	    1000	   1489260 ns/op	  88.01 MB/s
BenchmarkWeakHashBuzHashRoll-4     	    1000	   1154916 ns/op	 113.49 MB/s
PASS
ok  	github.com/syncthing/syncthing/lib/weakhash	21.007s

But the actual roll is quite shit overall

calmh · January 3, 2017, 8:24pm

Yeah. Anything we need to do byte by byte for an entire file is going to suck. But that part of it can at least be switched off by configuration, as opposed to what we do per block in the scanner.

If we wanted to go the complicated route we could add some of the heuristics we talked about when introducing the weakhash - only do the rolling if we see a lot of blocks that differ after a certain point in the file. That would avoid us weakhashing a 75 gig file at 88 MB/s just to see if we can find the one 128 KiB block we are looking for instead of transferring it over the network in like a millisecond…

I think we could optimize it a fair bit by hooking into the digest directly as well, for the rolling case. We’re doing the same amount of work really, it’s just that we have a bunch of function calls for every byte hashed. If we instead passed the entire block to Write (or a function like it) and let it return 128k weak hashes in one pass it would probably be about the same speed.

JKing · January 3, 2017, 9:19pm

I actually also encountered doubled scanning progress a few days ago when I changed .stignore mid-scan. I assumed (apparently incorrectly) that this was expected behaviour, how ever weird (which is par for the course with .stignore).

Laurens · January 4, 2017, 10:49am

Hi! Hopefully, this helps; where I work, a few years ago, we needed to hash a LOT of data, and we spent some time into investigating which algo would work best for us (I wasn’t involved in this myself, so I don’t know exactly if our needs are similair to Syncthing’s needs.)

BUT: We ended up using Murmurhash; I just checked, there is a Go implementation of it. I googled around a bit to find some benchmarks which compare adler32 to murmurhash, and found one (which was comparing Java implementations, but still):

http://jpountz.github.io/lz4-java/1.3.0/xxhash-benchmark/

Which indicated that murmur(2) is faster (at least in java) than adler32.

BUT: it also pointed to xxHash being even faster.

It also has a go implementation, and it’s webpage is here:

http://cyan4973.github.io/xxHash/

Hope this helps!

wweich · January 10, 2017, 12:09pm

With 0.14.19 we are back in business

25,8 GiB vdi file, Core i5-4590S (“sha256Perf”: 361.6), VM and ST DB on System SSD (960 EVO NVMe). Single thread hash performance is 385 MB/s using crypto/sha256 (379 MB/s using minio/sha256-simd).

Scan time: ~1:40 Scan Speed: 271 MB/s

Still ~100 MB/s less then the test on startup, but a huge improvement over the 47 MB/s

calmh · January 10, 2017, 3:07pm

There is some overhead that brings down the scan speed a bit from the theoretical.

One we can calculate, although it’s not displayed in the startup or GUI: the weak hash checksum. Each block is hashed by SHA256 (385 MB/s above) and then the weak hash (~1200 MB/s on my computer). The resulting theoretical hash speed is 1/(1/385 + 1/1200) = 291 MB/s. We could potentially do these two in parallel, in the future…

Then there is some memory allocation, shuffling, waiting for disk/SSD etc that I think accounts for the rest.

wucke13 · January 10, 2017, 3:49pm

I got arround 15:40 for 75 GB, which is arround 80 MB/s. More or less doubled the speed for me. Note the printed hashing speed is still what it was in 0.14.18-1:

INFO: Single thread hash performance is 156 MB/s using minio/sha256-simd (117 MB/s using crypto/sha256).

As already mentioned by @wweich, that’s a pretty good improvement. Still, 80MB/s is about 10% less then half of the speed my disk is able to deliver. Is there a way to let at least 2 threads work on a single file, maybe by letting them run from the top to bottom and vice versa, meeting in the middle?