Folders out of sync, even after rebuilding indexes

Ah no I’m being a muppet. I managed top extract an old zip! I’ve got the right version and set the STDEADLOCKTIMEOUT=300 value and I see the deadlock detector is running now [R7DNA] 23:16:01 INFO: Starting deadlock detector with 5m0s timeout

Is this the right exe?

[SVKBT] 23:10:49 INFO: syncthing v0.14.9+16-g1cdfef4 “Dysprosium Dragonfly” (go1.7.1 windows-386) jenkins@build.syncthing.net 2016-10-27 19:47:14 UTC

Right, it detected a deadlock and made a panic log so here it is!

syncthing.log (330.2 KB)

Can you run with STTRACE=sync in addition to the timeout, so that it enables more debugging.

To be honest, looking at your panic, your firewall or something like that is getting in the way, network activity has been hanging for 10 minutes.

We should probably not be holding locks over operations that can possibly take minutes. On mobile right now so not digging into the trace but that sounds interesting, anyway.

STTRACE=Sync was enabled for that trace.

I think it’s unlikely the network is an issue as when the panic happens I have full connectivity between the nodes (no dropped pings, RDP active and responding etc).

I’ve replicated it with AV and firewall disabled. Here’s that trace: syncthing.log (7.8 MB)

It has to be lowercase sync.

Just a typo - it is lowercase

[ignore me, seems my buggy code does not work]

Please try the latest build from build.syncthing.net once it completes.

This one occurred on Node A while Node B was starting up:

syncthing_NodeA.log (465.3 KB)

The relevant info:

panic: deadlock detected at fmut
 fmut = current holder: model\model.go:1329 at routine 535
 pmut = current holder: model\model.go:1224 at routine 147

goroutine 535 [semacquire, 4 minutes]:
sync.runtime_Semacquire(0x1243e6ec)
	c:/go-386/src/runtime/sema.go:47 +0x26
sync.(*RWMutex).RLock(0x1243e6e0)
	c:/go-386/src/sync/rwmutex.go:43 +0x52
github.com/syncthing/syncthing/lib/model.(*Model).Completion(0x1218a000, 0x8287a130, 0x5f34ff27, 0xa23ec8ce, 0x1081a14f, 0xbaa991f0, 0x89448d91, 0x8874c18b, 0xa3a96ecc, 0x1237bf40, ...)
	c:/jenkins/workspace/syncthing-windows-branch/src/github.com/syncthing/syncthing/lib/model/model.go:487 +0x255
main.(*folderSummaryService).sendSummary(0x13f40bc0, 0x1237bf40, 0xf)
	c:/jenkins/workspace/syncthing-windows-branch/src/github.com/syncthing/syncthing/cmd/syncthing/summaryservice.go:208 +0x3b1
main.(*folderSummaryService).calculateSummaries(0x13f40bc0)
	c:/jenkins/workspace/syncthing-windows-branch/src/github.com/syncthing/syncthing/cmd/syncthing/summaryservice.go:154 +0x224
main.(*folderSummaryService).(main.calculateSummaries)-fm()
	c:/jenkins/workspace/syncthing-windows-branch/src/github.com/syncthing/syncthing/cmd/syncthing/summaryservice.go:50 +0x20
main.serviceFunc.Serve(0x129594d0)
	c:/jenkins/workspace/syncthing-windows-branch/src/github.com/syncthing/syncthing/cmd/syncthing/summaryservice.go:229 +0x1a
github.com/syncthing/syncthing/vendor/github.com/thejerf/suture.(*Supervisor).runService.func1(0x12137680, 0x1, 0xd97470, 0x129594d0)
	c:/jenkins/workspace/syncthing-windows-branch/src/github.com/syncthing/syncthing/vendor/github.com/thejerf/suture/supervisor.go:454 +0x4f
created by github.com/syncthing/syncthing/vendor/github.com/thejerf/suture.(*Supervisor).runService
	c:/jenkins/workspace/syncthing-windows-branch/src/github.com/syncthing/syncthing/vendor/github.com/thejerf/suture/supervisor.go:457 +0x4a
goroutine 147 [semacquire, 4 minutes]:
sync.runtime_Semacquire(0x1243e69c)
	c:/go-386/src/runtime/sema.go:47 +0x26
sync.(*RWMutex).RLock(0x1243e690)
	c:/go-386/src/sync/rwmutex.go:43 +0x52
github.com/syncthing/syncthing/lib/model.(*Model).generateClusterConfig(0x1218a000, 0x8287a130, 0x5f34ff27, 0xa23ec8ce, 0x1081a14f, 0xbaa991f0, 0x89448d91, 0x8874c18b, 0xa3a96ecc, 0x0, ...)
	c:/jenkins/workspace/syncthing-windows-branch/src/github.com/syncthing/syncthing/lib/model/model.go:1865 +0x83
github.com/syncthing/syncthing/lib/model.(*Model).AddConnection(0x1218a000, 0x142ec400, 0x9cfad0, 0xc, 0xa, 0xd9b220, 0x125cddb0, 0x1415ece0, 0x9, 0x1415ecf0, ...)
	c:/jenkins/workspace/syncthing-windows-branch/src/github.com/syncthing/syncthing/lib/model/model.go:1264 +0xe59
github.com/syncthing/syncthing/lib/connections.(*Service).handle(0x123361b0)
	c:/jenkins/workspace/syncthing-windows-branch/src/github.com/syncthing/syncthing/lib/connections/service.go:260 +0x2170
github.com/syncthing/syncthing/lib/connections.(*Service).(github.com/syncthing/syncthing/lib/connections.handle)-fm()
	c:/jenkins/workspace/syncthing-windows-branch/src/github.com/syncthing/syncthing/lib/connections/service.go:113 +0x20
github.com/syncthing/syncthing/lib/connections.serviceFunc.Serve(0x12162ba8)
	c:/jenkins/workspace/syncthing-windows-branch/src/github.com/syncthing/syncthing/lib/connections/structs.go:85 +0x1a
github.com/syncthing/syncthing/vendor/github.com/thejerf/suture.(*Supervisor).runService.func1(0x12336120, 0x1, 0xd973b0, 0x12162ba8)
	c:/jenkins/workspace/syncthing-windows-branch/src/github.com/syncthing/syncthing/vendor/github.com/thejerf/suture/supervisor.go:454 +0x4f
created by github.com/syncthing/syncthing/vendor/github.com/thejerf/suture.(*Supervisor).runService
	c:/jenkins/workspace/syncthing-windows-branch/src/github.com/syncthing/syncthing/vendor/github.com/thejerf/suture/supervisor.go:457 +0x4a

Yet looking at this:

[R7DNA] 2016/10/30 08:51:07.957963 sync.go:111: DEBUG: RWMutex took 1m31.8756011s to lock. Locked at model\model.go:1329. RUnlockers while locking: model\model.go:1080, model\model.go:2006, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:2006, model\rwfolder.go:381, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\rwfolder.go:210, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:1080, model\model.go:2006, model\model.go:2006, model\model.go:2195, model\rwfolder.go:210, model\model.go:2006, model\rwfolder.go:210, model\model.go:2006, model\model.go:2006, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591, model\model.go:591

It looks like it’s mostly contention?

Actually the code that is supposed to capture the issue is not good enough, as it seems some rlocker() is stuck somewhere.

Here’s a Node B panic log from the latest build

syncthing.log (17.1 MB)

Can you try:

https://build.syncthing.net/job/syncthing/2335/

with STTRACE=sync STDEADLOCKTIMEOUT=300 thanks.

I need a Windows build so I’ll wait for that to complete tomorrow

https://build.syncthing.net/job/syncthing-windows/285/ I’ve kicked it off.

Here’s the log from the latest build. This is with Node B running and Node A shut down. It stopped shortly after scanning was complete

syncthing_NodeB_solo.log (5.3 MB)

Can you try latest build from: https://build.syncthing.net/

And also set STDEADLOCK=250 env var together with the others.

I now know exactly where it happens and which commit introduced it, yet I can’t explain why it would happen as both mutexes have locks in read mode and are trying to acquire each others lock in read mode too.