Hanging panics in integration tests

imsodin · December 16, 2018, 2:22pm

I wanted to add some “assert-style” panics to the code and run integration-tests in the hope of getting something tangible regarding https://github.com/syncthing/syncthing/issues/5340. The I came across this very weird issue:

diff --git a/lib/db/meta.go b/lib/db/meta.go
index dffc3f3ed..9ed9cfd52 100644
--- a/lib/db/meta.go
+++ b/lib/db/meta.go
@@ -8,6 +8,7 @@ package db
 
 import (
        "bytes"
+       "fmt"
        "math/bits"
        "time"
 
@@ -112,6 +113,8 @@ func (m *metadataTracker) addFile(dev protocol.DeviceID, f FileIntf) {
                return
        }
 
+       fmt.Println("panic outside lock")
+       panic("panic outside lock")
        m.mut.Lock()
        m.dirty = true

Build and run the SyncCluster test results in the expected two lines and the trace afterwards:

panic outside lock
panic: panic outside lock

Inserting the panic after the lock like this

diff --git a/lib/db/meta.go b/lib/db/meta.go
index dffc3f3ed..c14cf43df 100644
--- a/lib/db/meta.go
+++ b/lib/db/meta.go
@@ -8,6 +8,7 @@ package db
 
 import (
        "bytes"
+       "fmt"
        "math/bits"
        "time"
 
@@ -114,6 +115,8 @@ func (m *metadataTracker) addFile(dev protocol.DeviceID, f FileIntf) {
 
        m.mut.Lock()
        m.dirty = true
+       fmt.Println("panic inside lock")
+       panic("panic inside lock")
 
        if flags := f.FileLocalFlags(); flags == 0 {
                // Account regular files in the zero-flags bucket.

Results in the fmt line to be printed, but no panic - the process even seems to partially carry on before hanging:

panic inside lock
panic inside lock
[I6KAH] 15:19:35 INFO: Detected 3 NAT services

I can’t find anything helpful in the trace when the test finally times out: test.out (8.1 KB)

Any ideas what the reason might be?

calmh · December 16, 2018, 4:36pm

Are we missing something with the suture-should-not-restart-panics in that code path?

If the code path is from the HTTP server it would swallow the panic (though also print it) and we’d hang on the lock next time through. addFile though shouldn’t get called that way, afaik.

imsodin · December 16, 2018, 6:08pm

The panics works if it’s in the exact same function above the lock - so it can’t be a missed suture-passthrough-panic-thing.

calmh · December 16, 2018, 6:52pm

I admit I have no better idea…

imsodin · December 16, 2018, 9:00pm

Pity, I was hoping this was some not so obvious consequence of an obscure language spec that you knew and not a weird problem that should be impossible and might also be specific to my system/syncthing integration tests/moon phase/…

calmh · December 16, 2018, 9:47pm

Well, it’s not just you - I see the same thing. Panic before the lock panics as it should. Panic inside the lock is swallowed. By what, I don’t know, but it’s bad and it shouldn’t happen.

Edit: Oh, I know. addFiles is called from (s *FileSet) Update(...) and it panics somewhere in the middle. But defers run on panic, and we have a defer s.meta.toDB(...) in there which tries to acquire the same lock while unwinding from the panic. So the panic never completes.

If addFiles used a defer unlock instead (and we don’t panic precisely between the lock and the defer-unlock, which should be impossible) it works.

Guideline: prefer defer unlocks…

imsodin · December 16, 2018, 11:39pm

Thanks a lot for finding the cause!
Apparently Go’s panic isn’t panicked enough if it lets itself be restrained by something as feeble as a simple lock

~~I found something completely puzzling concerning metadataTracker, which means it’s either broken or I am too tired already. To rule out the second I’ll go to sleep and have a look another time.~~ I have been sleeping

calmh · December 17, 2018, 7:11am

I think this is a double whammy of not using deferred unlock and doing stuff that takes locks in a defer. Both of those are probably suboptimal practice.