Folders out of sync, even after rebuilding indexes

Still no progress after 13 hours:

[SVKBT] 2016/10/22 09:46:29.064427 walk.go:189: DEBUG: Walk \?\Z:\Movies-Temp [] current progress 1810676820/4121401086 at 0.0 MB/s (43%)

It’s not a slow computer - Quad core 2.8Ghz cpu / 6GB memory so it looks like something is blocking creation of the indexes. There are no .ldb files yet and the few files in the index folder haven’t been touched since 20.44 yesterday when the indexing appears to have stopped.

Is the number growing? If not it looks like it’s on a network drive that is borked that blocks reads, if you are storing the index there, it’s even worse.

Nope, the number hasn’t grown for 13 hrs. It’s on a local drive.

I’ve renamed the index folder and restarted indexing to see if I get the same issue again.

edit it’s created the first .ldb file so it’s got further than before!

Since this is Windows, I’m going to suggest disabling antivirus scanners straight off the bat.

So the indexes have finished rebuilding:
Node B was started alone, indexes built and shut down - everything showing “up to date” Node A was started alone, indexes built - everything showing “up to date” Node B was then restarted and left to scan and up as needed Much log activity for a while, and then nothing.

On Node B:

Some .tmp files observed created again but staying as .tmp files. This a grep of the the log for one of them - ~syncthing~DSC00966.ARW.tmp

[SVKBT] 2016/10/25 11:40:13.615936 sharedpullerstate.go:251: DEBUG: sharedPullerState RAW-Temp RAW-2015-12-15-Christmas-Show\DSC00966.ARW pullNeeded start -> 14 [SVKBT] 2016/10/25 11:40:13.659938 sharedpullerstate.go:251: DEBUG: sharedPullerState RAW-Temp RAW-2015-12-15-Christmas-Show\DSC00966.ARW pullNeeded start -> 15 [SVKBT] 2016/10/25 11:40:13.718942 sharedpullerstate.go:251: DEBUG: sharedPullerState RAW-Temp RAW-2015-12-15-Christmas-Show\DSC00966.ARW pullNeeded start -> 16 [SVKBT] 2016/10/25 11:40:13.745943 sharedpullerstate.go:251: DEBUG: sharedPullerState RAW-Temp RAW-2015-12-15-Christmas-Show\DSC00966.ARW pullNeeded start -> 17 [SVKBT] 2016/10/25 11:40:14.042960 progressemitter.go:63: DEBUG: progress emitter: timer - looking after 5 [SVKBT] 2016/10/25 11:40:14.042960 progressemitter.go:103: DEBUG: progress emitter: emitting map[string]map[string]*model.pullerProgress{“Nexus5-Camera”:map[string]*model.pullerProgress{“IMG_20161022_152218.jpg”:(*model.pullerProgress)(0xc047a1fbc0)}, “Movies-Temp”:map[string]*model.pullerProgress{“2016-09-13-Dancing\VID_20160928_Car170451.mp4”:(*model.pullerProgress)(0xc047a1fc00)}, “RAW-Temp”:map[string]*model.pullerProgress{“RAW-2015-12-15-Christmas-Show\DSC00966.ARW”:(*model.pullerProgress)(0xc047a1fb00)}, “PICS-Misc”:map[string]*model.pullerProgress{“Cycling\PANO_20160816_064724.jpg”:(*model.pullerProgress)(0xc047a1fb40)}, “PICS-Temp”:map[string]*model.pullerProgress{“2016-08-06-France-Holiday\IMG_20160723_164138.jpg”:(*model.pullerProgress)(0xc047a1fb80)}} [SVKBT] 2016/10/25 11:40:14.108964 rwfolder.go:196: DEBUG: rwFolder/Movies@0xc0435a0000 remote index updated, rescheduling pull [SVKBT] 2016/10/25 11:40:14.761001 sharedpullerstate.go:261: DEBUG: sharedPullerState Nexus5-Camera IMG_20161022_152218.jpg pullNeeded done -> 14 [SVKBT] 2016/10/25 11:40:15.483043 sharedpullerstate.go:261: DEBUG: sharedPullerState Nexus5-Camera IMG_20161022_152218.jpg pullNeeded done -> 13 [SVKBT] 2016/10/25 11:40:19.043246 progressemitter.go:63: DEBUG: progress emitter: timer - looking after 5 [SVKBT] 2016/10/25 11:40:19.043246 progressemitter.go:103: DEBUG: progress emitter: emitting map[string]map[string]*model.pullerProgress{“PICS-Temp”:map[string]*model.pullerProgress{“2016-08-06-France-Holiday\IMG_20160723_164138.jpg”:(*model.pullerProgress)(0xc04727e640)}, “Nexus5-Camera”:map[string]*model.pullerProgress{“IMG_20161022_152218.jpg”:(*model.pullerProgress)(0xc04727e680)}, “Movies-Temp”:map[string]*model.pullerProgress{“2016-09-13-Dancing\VID_20160928_Car170451.mp4”:(*model.pullerProgress)(0xc04727e6c0)}, “RAW-Temp”:map[string]*model.pullerProgress{“RAW-2015-12-15-Christmas-Show\DSC00966.ARW”:(*model.pullerProgress)(0xc04727e700)}, “PICS-Misc”:map[string]*model.pullerProgress{“Cycling\PANO_20160816_064724.jpg”:(*model.pullerProgress)(0xc04727e740)}}

Node B now shows some errors rolling in the log

[SVKBT] 2016/10/25 20:46:27.314484 progressemitter.go:82: DEBUG: progress emitter: nothing new [SVKBT] 2016/10/25 20:46:32.314770 progressemitter.go:63: DEBUG: progress emitter: timer - looking after 5 [SVKBT] 2016/10/25 20:46:32.314770 progressemitter.go:82: DEBUG: progress emitter: nothing new [SVKBT] 2016/10/25 20:46:37.315056 progressemitter.go:63: DEBUG: progress emitter: timer - looking after 5 [SVKBT] 2016/10/25 20:46:37.315056 progressemitter.go:82: DEBUG: progress emitter: nothing new

The web gui shows this error:

On Node A

The web gui shows everything up to date and all devices disconnected. I don’t see any open connections to Node B in netstat -an

Last contact time to Node B shows 11:29 (~11hrs ago) so I think something is forcing the two to disconnect. I can however telnet to port 22000 from each other so there’s nothing blocking them from talking. They both have 5/8 Discovery sources and 2/2 listeners. I believe the web interface on the Node A isn’t up to date though as it shows 9hrs uptime, but last restart was actually 10.5hrs ago.

These sort of messages scrolling through the log: [R7DNA] 2016/10/25 22:39:29.633389 rwfolder.go:223: DEBUG: rwFolder/RAW-PICS@0x121be160 skip (curSeq == prevSeq) 0 true [R7DNA] 2016/10/25 22:39:32.684412 rwfolder.go:223: DEBUG: rwFolder/PICS-Misc@0x121bed10 skip (curSeq == prevSeq) 0 true

On Node B

I restarted syncthing on Node B and there was action in the log as was rescanning. About half the folders are Out of Sync again since it completed.

The grep for DSC00966.ARW shows this now

syncthing.log:[SVKBT] 2016/10/25 22:45:07.147633 walk.go:252: DEBUG: temporary: RAW-2015-12-15-Christmas-Show~syncthing~DSC00966.ARW.tmp syncthing.log:[SVKBT] 2016/10/25 22:45:58.361562 rwfolder.go:442: DEBUG: rwFolder/RAW-Temp@0xc043310000 handling RAW-2015-12-15-Christmas-Show\DSC00966.ARW syncthing.log:[SVKBT] 2016/10/25 23:07:31.819543 rwfolder.go:442: DEBUG: rwFolder/RAW-Temp@0xc043310000 handling RAW-2015-12-15-Christmas-Show\DSC00966.ARW syncthing.log:[SVKBT] 2016/10/25 23:08:01.591246 rwfolder.go:442: DEBUG: rwFolder/RAW-Temp@0xc043310000 handling RAW-2015-12-15-Christmas-Show\DSC00966.ARW

The log on Node B is showing similar contents to Node A

[SVKBT] 2016/10/25 23:27:20.765547 rwfolder.go:223: DEBUG: rwFolder/PICS-Misc@0xc04220a000 skip (curSeq == prevSeq) 2496 true [SVKBT] 2016/10/25 23:27:20.906555 rwfolder.go:223: DEBUG: rwFolder/PICS-Temp@0xc043472640 skip (curSeq == prevSeq) 589 true

On Node A

The log now shows these messages since the rescan of Node B finished

[R7DNA] 2016/10/25 23:29:24.282047 progressemitter.go:63: DEBUG: progress emitter: timer - looking after 1 [R7DNA] 2016/10/25 23:29:24.282047 progressemitter.go:82: DEBUG: progress emitter: nothing new [R7DNA] 2016/10/25 23:29:29.286170 progressemitter.go:63: DEBUG: progress emitter: timer - looking after 1

And a couple of timeouts in the logs. Is there a way to find out why these occurred and why it didn’t reconnect?

On Node A

No read timeouts in the log

On Node B

[SVKBT] 2016/10/25 11:40:09.197683 model.go:964: INFO: Connection to R7DNAZX closed: read timeout [SVKBT] 2016/10/25 22:56:17.768990 model.go:964: INFO: Connection to R7DNAZX closed: read timeout

So I’ve tried several restarts and it looks like it disconnects after 10 minutes and the nodes don’t attempt to reconnect. The log is still writing so syncthing is still running on both nodes. I can telnet to port 22000 on both nodes from each other too, but they just don’t seem to want to talk to each other!

Connect/Disconnect events:

Node A [R7DNA] 2016/10/27 00:01:39.719067 service.go:257: INFO: Established secure connection to SVKBTN2 at 192.168.1.20:22000-192.168.1.18:57589 (TCP (Server)) [R7DNA] 2016/10/27 00:17:08.804218 model.go:964: INFO: Connection to SVKBTN2 closed: read timeout

Node B [SVKBT] 2016/10/27 00:01:32.433693 service.go:257: INFO: Established secure connection to R7DNAZX at 192.168.1.18:57589-192.168.1.20:22000 (TCP (Client)) [SVKBT] 2016/10/27 00:16:34.663298 model.go:964: INFO: Connection to R7DNAZX closed: read timeout

Errors in the logs after disconnection

These are rolling through the logs…

Node A [R7DNA] 2016/10/27 09:34:18.210760 rwfolder.go:223: DEBUG: rwFolder/RAW-PICS@0x13c68210 skip (curSeq == prevSeq) 0 true [R7DNA] 2016/10/27 09:34:18.212761 rwfolder.go:223: DEBUG: rwFolder/RAW-Temp@0x13c68370 skip (curSeq == prevSeq) 59 true

Node B [SVKBT] 2016/10/27 09:28:54.423784 progressemitter.go:63: DEBUG: progress emitter: timer - looking after 6 [SVKBT] 2016/10/27 09:28:54.423784 progressemitter.go:82: DEBUG: progress emitter: nothing new

I suggest you check resource usage on the machines, health of the storage, antivirus as judging by what you described your storage just hangs for some reason.

No storage doesn’t hang. I’ve got other applications reading/writing to the same storage without issue. I also have different folders on different physical spindles. Other resources are fine with memory and cpu free. Antivirus is disabled.

I would at least hope Syncthing attempts to reconnect if it disconnects due to slow storage once that storage is available again.

Also, these machines have been running syncthing for a couple of years now without any issues. This is a recent issue with no changes to the hardware/software. If anything the only thing that’s changed is it’s since I upgraded Syncthing to v0.14.

Well you can see reconnect attempts with STTRACE=connections.

If you can pinpoint which exact release broke it for you, we could see what changed.

Unfortunately I can’t as the issue spread through the folders slowly.

Changing the settings on Node A to use fixed IP of Node B rather than dynamic discovery appears to have at least stopped the disconnection issues and they appear to be connected (although the Web GUI is still throwing the local connection error so I can only go by Netstat)

C:>netstat -an | find "22000" TCP 0.0.0.0:22000 0.0.0.0:0 LISTENING TCP 192.168.1.18:51199 192.168.1.20:22000 ESTABLISHED TCP 192.168.1.18:56534 192.168.1.20:22000 ESTABLISHED TCP [::]:22000 [::]:0 LISTENING

For the file in question above, here’s the log output from Node B (which needs the file). Node A has the file ok and there’s nothing in the log on Node A referencing this file DSC00966.ARW

[SVKBT] 2016/10/27 22:56:55.524970 main.go:1011: INFO: Restarting [SVKBT] 2016/10/27 23:11:17.097206 blockqueue.go:46: DEBUG: open: open \?\Z:\RAW-Temp\RAW-2015-12-15-Christmas-Show~syncthing~DSC00966.ARW.tmp: The system cannot find the file specified. [SVKBT] 2016/10/27 23:11:17.524230 rwfolder.go:1026: DEBUG: rwFolder/RAW-Temp@0xc0426dd040 need file RAW-2015-12-15-Christmas-Show\DSC00966.ARW; copy 129, reused [] [SVKBT] 2016/10/27 23:11:18.818304 progressemitter.go:212: DEBUG: progress emitter: registering RAW-Temp RAW-2015-12-15-Christmas-Show\DSC00966.ARW [SVKBT] 2016/10/27 23:11:19.098320 sharedpullerstate.go:251: DEBUG: sharedPullerState RAW-Temp RAW-2015-12-15-Christmas-Show\DSC00966.ARW pullNeeded start -> 1 [SVKBT] 2016/10/27 23:11:19.364335 sharedpullerstate.go:215: INFO: Puller (folder “RAW-Temp”, file “RAW-2015-12-15-Christmas-Show\DSC00966.ARW”): pull: peers who had this file went away, or the file has changeile syncing. will retry later [SVKBT] 2016/10/27 23:11:19.518344 rwfolder.go:1326: DEBUG: rwFolder/RAW-Temp@0xc0426dd040 closing RAW-2015-12-15-Christmas-Show\DSC00966.ARW [SVKBT] 2016/10/27 23:11:19.969370 progressemitter.go:224: DEBUG: progress emitter: deregistering RAW-Temp RAW-2015-12-15-Christmas-Show\DSC00966.ARW

Well the message is somewhat clear, you got disconnected or something along these lin, … The fact that the UI is cutting out with disconnection errors it might be that syncthing is crashing, but you should see it restart in the console. Alternatively, look at the browser debug tools which requests are failing to return and why.

The thing is the process isn’t crashing - it’s still writing to the logfiles and there’s no startup messages logged.

The two nodes are still connected as per the netstat output.

Tcp connections can stay around for a while after being torn down in time wait state. Check -help and see how to run with a deadlock detector, set it to 5mins on both sides.

So I’ve just read there are two syncthing processes, so I guess the outer “wrapper” process is running ok and logging but the inner “sync” process is crashing. This is happening on both nodes so I guess is a bug or unhandled exception. What’s the best way to raise a bug / issue? I’m happy to debug and troubleshoot.

Here’s the Web GUI with the browser debugging - it looks like it’s an infinite loop loading elements.

and here’s the end of the page…

It looks like a deadlock we are trying to hunt down (there is an issue open), I suggest you enable the deadlock detector and look for the issue on github for the instrumented build.

Are you runnig on a mac?

Nope Windows for both systems - Win10 x86 for Node A and Win7 x64 for Node B.

I’m new to debugging syncthing and google doesn’t appear to want to help :frowning:

Do you mean this release or is this just the standard release?
https://build.syncthing.net/job/syncthing-release-windows/

Do i need to use any special parameters to enable the deadlock detector or anything?

Strange, the zip is called syncthing-windows-386-v0.14.9+16-g1cdfef4.zip but the syncthing.exe reports v0.13.10 and builds an index-v0.13.0.db folder so I guess that’s not what I need…

Is there a page showing the debugging steps somewhere?

It’s in the docs. You want to run with STTRACE=sync if you are using the latest development snapshot from build.syncthing.net, and STDEADLOCKTIMEOUT=300

I thought I’d got the latest dev snapshot build but that appears to be v0.13.10 even though the zip it’s in is named 0.14.9. Is something broken with the automated build process or is there another download link?

I also don’t understand what you mean by deadlock detector - is this a parameter to syncthing or another utility? I’ve got Syncthing-Utils but none of the exes sounds like they’re to do with deadlocks.