Syncthing process stalled after Windows hibernation

This has happened after a system hibernation in Windows. Basically, Syncthing had been working fine until the OS was hibernated, but then refused to “become alive” again once the OS was resumed from it later.

This is the current state. As you can see, the process is still “running”, but there’s no CPU or I/O activity, and the RAM usage is lower than normal. The device itself appears “disconnected”, and the Web GUI isn’t loading either.

image

Is there anything I can do to debug the problem in this state? I believe I’ve experienced a similar issue before a few times, but I had no idea what was going on then. This time, I’m sure this is connected with hibernation (based on the “last seen” time and the event timing recorded in Windows logs).

Syncthing is v1.18.6 (custom built) and the OS is Windows 10 x86.

What does the logfile have to say? Can you reproduce the problem? What changes have you made in your custom build? If you have made changes, can you reproduce with an unmodified instance of the latest release?

1 Like

This can’t be reproduced on demand. I could try to keep hibernating and resuming the system until the problem recurs though. The logfile, at least the standard non-debug one, doesn’t help here, as it just stops recording events when the OS enters hibernation and never comes back to life after that.

I doubt this has got anything to do with the custom build though, as a) my changes are mostly cosmetic, and b) I experienced similar problems way in the past when I was still using the official builds (only then I didn’t connect the dots that it could be related to hibernation).

What I would be interested in the most is to somehow debug or terminate the currently stalled process in a way that would generate something useful (e.g. a dump, etc.). I’ve got little idea how to do something like this on Windows, unfortunately.

Next time I have my hands on hardware with Windows 10 on it I’ll check it but this isn’t something I’ve seen before. Sysinternals’ Process Explorer can generate dumps of processes. I’m more accustomed to diagnosing memory crash dumps but I am happy to look at a process dump if you can get one. Keep in mind it’ll have things like API keys, HTTPS private key material etc. in it. You could set-up a separate test instance and keep it running; if that also suffers then you could send me a dump and it won’t have anything confidential in it.

There is an option called “Restart On Wakeup” in advanced config that may be related.

Thank you very much for your help offer. I’ve finally come back to the question at hand, and I’ve also taken the dumps with Process Explorer. However, I’ve also installed and opened them with “Debugging Tools for Windows” first, and there’s quite a lot of information there that I’m not necessarily comfortable sharing, so I’ve decided to try running the debugger/analyser on my own.

This is the output of windbg.exe for the main syncthing.exe process.

Debug session time: Sat Feb  5 19:40:52.000 2022 (UTC + 1:00)
System Uptime: 12 days 3:47:53.671
Process Uptime: 6 days 3:27:03.000
...........................
For analysis of this file, run !analyze -v
eax=0000007f ebx=00000000 ecx=01b4f5d4 edx=77682750 esi=00000000 edi=01b4f628
eip=77682750 esp=01b4f5d4 ebp=01b4f64c iopl=0         nv up ei pl zr na pe cy
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000247
ntdll!KiFastSystemCallRet:
77682750 c3              ret
0:000> !analyze -v
*******************************************************************************
*                                                                             *
*                        Exception Analysis                                   *
*                                                                             *
*******************************************************************************

*** WARNING: Unable to verify timestamp for Syncthing.exe

KEY_VALUES_STRING: 1

    Key  : Analysis.CPU.mSec
    Value: 2421

    Key  : Analysis.DebugAnalysisManager
    Value: Create

    Key  : Analysis.Elapsed.mSec
    Value: 23961

    Key  : Analysis.Init.CPU.mSec
    Value: 3749

    Key  : Analysis.Init.Elapsed.mSec
    Value: 67244

    Key  : Analysis.Memory.CommitPeak.Mb
    Value: 76

    Key  : Timeline.OS.Boot.DeltaSec
    Value: 1050473

    Key  : Timeline.Process.Start.DeltaSec
    Value: 530823

    Key  : WER.OS.Branch
    Value: vb_release

    Key  : WER.OS.Timestamp
    Value: 2019-12-06T14:06:00Z

    Key  : WER.OS.Version
    Value: 10.0.19041.1

    Key  : WER.Process.Version
    Value: 1.18.6.0


NTGLOBALFLAG:  0

PROCESS_BAM_CURRENT_THROTTLED: 0

PROCESS_BAM_PREVIOUS_THROTTLED: 0

APPLICATION_VERIFIER_FLAGS:  0

EXCEPTION_RECORD:  (.exr -1)
ExceptionAddress: 00000000
   ExceptionCode: 80000003 (Break instruction exception)
  ExceptionFlags: 00000000
NumberParameters: 0

FAULTING_THREAD:  00000b14

PROCESS_NAME:  Syncthing.exe

ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION}  Breakpoint  A breakpoint has been reached.

EXCEPTION_CODE_STR:  80000003

STACK_TEXT:  
01b4f5d0 7767f60a     756dafd2 00000118 01b4f6fc ntdll!KiFastSystemCallRet
01b4f5d4 756dafd2     00000118 01b4f6fc 00000040 ntdll!NtRemoveIoCompletionEx+0xa
01b4f64c 00566ef5     00000118 01b4f6fc 00000040 KERNELBASE!GetQueuedCompletionStatusEx+0x82
WARNING: Stack unwind information not available. Following frames may be wrong.
01b4f688 756db008     756db025 005333f5 00566ed0 Syncthing+0x66ef5
01b4f6ec 00566ef5     00000118 0173b3a0 00000000 KERNELBASE!GetQueuedCompletionStatusEx+0xb8
00000000 00000000     00000000 00000000 00000000 Syncthing+0x66ef5


STACK_COMMAND:  ~0s; .ecxr ; kb

SYMBOL_NAME:  kernelbase!GetQueuedCompletionStatusEx+82

MODULE_NAME: KERNELBASE

IMAGE_NAME:  KERNELBASE.dll

FAILURE_BUCKET_ID:  BREAKPOINT_80000003_KERNELBASE.dll!GetQueuedCompletionStatusEx

OS_VERSION:  10.0.19041.1

BUILDLAB_STR:  vb_release

OSPLATFORM_TYPE:  x86

OSNAME:  Windows 10

IMAGE_VERSION:  10.0.19041.1466

FAILURE_ID_HASH:  {9ee8acbb-8b05-fc0f-684d-27495ffb8928}

Followup:     MachineOwner
---------

Below is another one for the monitor syncthing.exe process.

Debug session time: Sat Feb  5 19:42:49.000 2022 (UTC + 1:00)
System Uptime: 12 days 3:49:40.795
Process Uptime: 12 days 3:48:29.000
....................
For analysis of this file, run !analyze -v
eax=0000000b ebx=0173b518 ecx=004ffd14 edx=77682750 esi=00000000 edi=00000110
eip=77682750 esp=004ffd14 ebp=004ffd88 iopl=0         nv up ei pl zr na pe cy
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000247
ntdll!KiFastSystemCallRet:
77682750 c3              ret
0:000> !analyze -v
*******************************************************************************
*                                                                             *
*                        Exception Analysis                                   *
*                                                                             *
*******************************************************************************

*** WARNING: Unable to verify timestamp for syncthing.exe

KEY_VALUES_STRING: 1

    Key  : Analysis.CPU.mSec
    Value: 1530

    Key  : Analysis.DebugAnalysisManager
    Value: Create

    Key  : Analysis.Elapsed.mSec
    Value: 1492

    Key  : Analysis.Init.CPU.mSec
    Value: 1046

    Key  : Analysis.Init.Elapsed.mSec
    Value: 9032

    Key  : Analysis.Memory.CommitPeak.Mb
    Value: 74

    Key  : Timeline.OS.Boot.DeltaSec
    Value: 1050580

    Key  : Timeline.Process.Start.DeltaSec
    Value: 1050509

    Key  : WER.OS.Branch
    Value: vb_release

    Key  : WER.OS.Timestamp
    Value: 2019-12-06T14:06:00Z

    Key  : WER.OS.Version
    Value: 10.0.19041.1

    Key  : WER.Process.Version
    Value: 1.18.6.0


NTGLOBALFLAG:  0

PROCESS_BAM_CURRENT_THROTTLED: 0

PROCESS_BAM_PREVIOUS_THROTTLED: 0

APPLICATION_VERIFIER_FLAGS:  0

EXCEPTION_RECORD:  (.exr -1)
ExceptionAddress: 00000000
   ExceptionCode: 80000003 (Break instruction exception)
  ExceptionFlags: 00000000
NumberParameters: 0

FAULTING_THREAD:  000010e0

PROCESS_NAME:  syncthing.exe

ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION}  Breakpoint  A breakpoint has been reached.

EXCEPTION_CODE_STR:  80000003

STACK_TEXT:  
004ffd10 7767e7da     7561d110 00000110 00000000 ntdll!KiFastSystemCallRet
004ffd14 7561d110     00000110 00000000 00000000 ntdll!NtWaitForSingleObject+0xa
004ffd88 7561d052     00000110 ffffffff 00000000 KERNELBASE!WaitForSingleObjectEx+0xb0
004ffd9c 00566ef5     00000110 ffffffff 00565e8e KERNELBASE!WaitForSingleObject+0x12
WARNING: Stack unwind information not available. Following frames may be wrong.
004ffdc8 00564cb1     00564bd2 00564bd2 00564bf5 syncthing+0x66ef5
004ffdcc 00564bd2     00564bd2 00564bf5 00564bd9 syncthing+0x64cb1
004ffdd0 00564bd2     00564bf5 00564bd9 7756cfc9 syncthing+0x64bd2
004ffdd4 00564bf5     00564bd9 7756cfc9 004ffe74 syncthing+0x64bd2
004ffdd8 00564bd9     7756cfc9 004ffe74 00000000 syncthing+0x64bf5
004ffddc 7756cfc9     004ffe74 00000000 00000000 syncthing+0x64bd9
004ffe54 7756cfc9     004ffe74 00000000 00000000 kernel32!BaseThreadInitThunk+0x19
004ffe64 7756cfc9     004ffe74 7756cfc9 00299000 kernel32!BaseThreadInitThunk+0x19
004ffe6c 7756cfc9     00299000 7756cfb0 004ffed8 kernel32!BaseThreadInitThunk+0x19
004ffe74 7756cfaf     004ffed8 776126c5 00299000 kernel32!BaseThreadInitThunk+0x19
004ffe7c 776126c5     00299000 3b512625 00000000 kernel32!TermsrvSyncUserIniFile+0x7f
004ffed8 77612699     ffffffff 77695c99 00000000 ntdll!__RtlUserThreadStart+0x2b
004ffee8 00000000     00566dd0 00299000 00000000 ntdll!_RtlUserThreadStart+0x1b


STACK_COMMAND:  ~0s; .ecxr ; kb

SYMBOL_NAME:  kernelbase!WaitForSingleObjectEx+b0

MODULE_NAME: KERNELBASE

IMAGE_NAME:  KERNELBASE.dll

FAILURE_BUCKET_ID:  BREAKPOINT_80000003_KERNELBASE.dll!WaitForSingleObjectEx

OS_VERSION:  10.0.19041.1

BUILDLAB_STR:  vb_release

OSPLATFORM_TYPE:  x86

OSNAME:  Windows 10

IMAGE_VERSION:  10.0.19041.1466

FAILURE_ID_HASH:  {6e08debd-ee31-ce56-32ef-878b2e67a5f2}

Followup:     MachineOwner
---------

Are these logs of any use, and if yes, can you see anything obvious in there?

Just looks like its blocked on io.

Presumably a network call as it went to hibernation.

Thats exactly the reason we restart after sleep, but I guess you are saying that is not working?

Do you have GOMAXPROC set to something absurdly small etc?

Normally it does restart. That time it didn’t. It’s not just that it didn’t restart though, but the process became completely unresponsive after the OS was resumed from hibernation, and that state lasted for 6 days (until I took the dumps, then killed the process and restarted Syncthing manually).

Yeah, it’s set to 1. The CPU on this device is weak, and some of the files are located on a slow storage too (i.e. an SD card), so I prefer to reduce the load and have only 1 folder being scanned at a time, just to be able to still use the thing while Syncthing is running.

Just for the record, normally there’re zero issues regarding accessing the Web GUI and the synchronisation in general. Everything is always very responsive, even when syncing a lot of files at a time. This particular issue only happened after the hibernation/resume cycle. Also, the database is located on the faster eMMC storage, and only some of the large folders on the slow SD card.

For the sake of completion, below is a screenshot from the Web GUI, showing the usual state when everything is working normally.

Yeah if you got only a single thread and that is blocked on io due to hibernation, there is no chance for the routine that does the “omg time jumped lets exit” to run.

I guess I am not sure how the web ui works in this case. (Edit, its not clear from your message that the ui worked post this lockup)

You either have to wait for the thread to unblock which for tcp could be hours, or run with more than a single thread.

1 Like

Thanks for the tip. I will increase GOMAXPROCS to 2 then. Does this mean that a single-core system is essentially doomed when something similar happens there?

No, no, the Web GUI was completely inaccessible too. Everything Syncthing-related basically stopped working, except for the syncthing.exe processes still hanging there in the background.

Yeah I don’t think you can reliably run it with gomaxproc=1, to be honest, not sure if you can even run it with gomaxproc=2, as setting a limit I think (not sure) forces io to use runtime threads, where as if its not set it spawns a thread per io operation, so by setting it to 2, just means you’ll end up in the same situation if there are two io operations at the same time.

The number of cores has no relation to how that single core schedules threads.

You can have a single core and multiple threads and it will still work.

1 Like

Does this mean that running Syncthing on a single-core machine with no GOMAXPROCS set is different than setting GOMAXPROCS to 1 on a multi-core one? I’ve always had the impression that the two lead to the same result.

I don’t think GOMAXPROCS applies to I/O threads at all, really. Running with it set to one should work imho.

Hmm, but wouldn’t that somewhat invalidate the current findings :smirk:?

If that’s the case, then something else must be the culprit here… I don’t think I tweak anything inside Syncthing itself that could possibly affect I/O or CPU performance though…

Just to reiterate, we’re talking about a situation where the process was stalled for 6 days straight (and would probably have stayed like that had I in the end not killed it manually).

I don’t see any findings other than you seeing process hangs. Many things could cause that, for example a database read or write taking “forever” for whatever reason.

What you could do is run with --debug-profiler-listen ($STPROFILER) and see if you can talk to the profiler when it’s hung. For example if you run it with STPROFILER=localhost:9090 you can get a goroutine trace at http://localhost:9090/debug/pprof/goroutine?debug=1. That would show what is wedged where.

If it’s dead enough that you can’t even talk to the profiler then your culprit is in the Go runtime or the OS, and we can’t help you any more.

“Findings” as in Audrius suggesting that GOMAXPROCS might be responsible for the process stall :wink:.

The profiler wasn’t turned on, unfortunately, so I couldn’t do anything else than to save the Windows process dumps. It is enabled now, so if the problem re-occurs, it should be ready. This isn’t something experienced on a regular basis though. I’ve probably seen similar stalls maybe a few times throughout my whole time of using Syncthing. I can, of course, try to repeatedly hibernate and resume the OS on purpose to possibly force it to happen.

So my claim is based on the fact the debug tool output only has a single thread and a single stack that is stuck on io

Also, if we claim that go does spawn a thread per io operation, and the debug tool only dumps the entrypoint/main thread, why would go runtime do io on the main thread if it was spawning a thread per io operation?

I have no idea what the Windows debug thing purports to show, but I’m certain I/O threads don’t count under GOMAXPROCS. This is easily demonstrated by starting Syncthing with GOMAXPROCS=1 and seeing that it has 9 or so threads running (on my trival test setup).

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.