100% CPU after a while and all logging stops

SanskritFritz · April 17, 2020, 3:57pm

This happens on two Syncthing instances connected to each other (and to others as well). I start syncthing via systemd: systemctl start syncthing@user

Syncthing Version: v1.4.2 (I’m using the same config on both sides for months the problem started in this version)

OS Version: Archlinux

The following env vars are set in systemd:

Environment=GOMAXPROCS=1 STTRACE=all

and

Environment=all_proxy=‘socks5://localhost:8010’ ALL_PROXY_NO_FALLBACK=1 http_proxy= https_proxy= GOMAXPROCS=1 STTRACE=all

All goes well for a while, the two instances are connected to each other, syncing perfectly, but after a while (I can’t associate it to an event) one of the instances CPU usage goes up to 100% and disconnects. The log stops here:

Apr 17 12:27:04 VM syncthing[81427]: [YSBJ4] DEBUG:   addresses:  [quic://46.139.64.79:1033 quic://46.139.64.79:22000 relay://163.172.44.215:22067/?id=AFN6YPK-MA5PKES-HT5T7AC-DTKET23-VAHTBZ7-CCDOAYY-7XZSR4X-MQWJ4A4&pingInterval=1m0s&networkTimeout=2m0s&sessionLimitBps=0&globalLimitBps=0&statusAddr=:22070&providedBy=XLU.SH relay://88.99.175.206:22067/?id=L5BVTMV-ICUMRZL-MHQEHNC-LDM4LEE-YIAQWA7-VYAICZ2-MWD4O2Q-CQ2APQB&pingInterval=1m0s&networkTimeout=2m0s&sessionLimitBps=0&globalLimitBps=4000000&statusAddr=:22070&providedBy=effekthasch tcp://46.139.64.79:22000]
Apr 17 12:27:04 VM syncthing[81427]: [YSBJ4] DEBUG: Reconnect loop for GPINCSM-ZFINXRS-3H2PKNC-UHSWD4M-TYLKFSY-ZLKO4JZ-2GSUEQ5-4QZYOAN [quic://46.139.64.79:1033 quic://46.139.64.79:22000 relay://163.172.44.215:22067/?id=AFN6YPK-MA5PKES-HT5T7AC-DTKET23-VAHTBZ7-CCDOAYY-7XZSR4X-MQWJ4A4&pingInterval=1m0s&networkTimeout=2m0s&sessionLimitBps=0&globalLimitBps=0&statusAddr=:22070&providedBy=XLU.SH relay://88.99.175.206:22067/?id=L5BVTMV-ICUMRZL-MHQEHNC-LDM4LEE-YIAQWA7-VYAICZ2-MWD4O2Q-CQ2APQB&pingInterval=1m0s&networkTimeout=2m0s&sessionLimitBps=0&globalLimitBps=4000000&statusAddr=:22070&providedBy=effekthasch tcp://46.139.64.79:22000]
Apr 17 12:27:04 VM syncthing[81427]: [YSBJ4] DEBUG: Not dialing using Relay Dialer as priority is less than current connection (200 >= 200)
Apr 17 12:27:04 VM syncthing[81427]: [YSBJ4] DEBUG: Not dialing using Relay Dialer as priority is less than current connection (200 >= 200)
Apr 17 12:27:04 VM syncthing[81427]: [YSBJ4] DEBUG: dialing GPINCSM-ZFINXRS-3H2PKNC-UHSWD4M-TYLKFSY-ZLKO4JZ-2GSUEQ5-4QZYOAN tcp://46.139.64.79:22000 prio 10
Apr 17 12:27:04 VM syncthing[81427]: [YSBJ4] DEBUG: Dialing no fallback result tcp 46.139.64.79:22000: <nil> socks connect tcp localhost:8010->46.139.64.79:22000: unknown error general SOCKS server failure
Apr 17 12:27:04 VM syncthing[81427]: [YSBJ4] DEBUG: dialing GPINCSM-ZFINXRS-3H2PKNC-UHSWD4M-TYLKFSY-ZLKO4JZ-2GSUEQ5-4QZYOAN tcp://46.139.64.79:22000 error: socks connect tcp localhost:8010->46.139.64.79:22000: unknown error general SOCKS server failure
Apr 17 12:27:04 VM syncthing[81427]: [YSBJ4] DEBUG: failed to connect to GPINCSM-ZFINXRS-3H2PKNC-UHSWD4M-TYLKFSY-ZLKO4JZ-2GSUEQ5-4QZYOAN 10
Apr 17 12:27:04 VM syncthing[81427]: [YSBJ4] DEBUG: dialing GPINCSM-ZFINXRS-3H2PKNC-UHSWD4M-TYLKFSY-ZLKO4JZ-2GSUEQ5-4QZYOAN quic://46.139.64.79:1033 prio 100
Apr 17 12:27:04 VM syncthing[81427]: [YSBJ4] DEBUG: dialing GPINCSM-ZFINXRS-3H2PKNC-UHSWD4M-TYLKFSY-ZLKO4JZ-2GSUEQ5-4QZYOAN quic://46.139.64.79:22000 prio 100

The GUI is still responsive and is shows the other instance as connected, which is wrong and no files are synced anymore. I have to restart Syncthing just to start over but after a while the problem repeats. The problem happens on both sides (not at the same time) seemingly not depending on any of the others.

AudriusButkevicius · April 17, 2020, 4:10pm

Why do you have STTRACE=all?

That logs pretty much every database operation effectively producing a second database in the logs.

SanskritFritz · April 17, 2020, 4:12pm

I just turned it on now to see if there is anything related to the 100% CPU. So it is not permanent of course.

AudriusButkevicius · April 17, 2020, 4:40pm

Is it syncthing process that is at 100% cpu usage? Syncthing scans which uses cpu, are you sure it’s not just scanning?

SanskritFritz · April 17, 2020, 5:30pm

Yes, htop shows one CPU core driven at 100%. If Syncthing were scanning, it would show something in the logs I guess (especially at debug level), and it also would not disconnect. Also, the CPU load stays permanently 100% for hours until I stop it. Graceful SIGTERM does not do, I have to SIGKILL it. Moreover local SyncthingGTK can not connect to the process, is just shows an empty window. To my best knowledge, nothing has been changed in the configs since the previous version which worked flawlessly. I don’t think that the scanning triggers the load, since it can work several hours normally and I have left the Full rescan interwall at default 3600 for every folder.

AudriusButkevicius · April 17, 2020, 5:36pm

Check the docs on how to run a CPU profile or send a SIGQUIT (which will dump the stack trace) to see what it’s doing.

imsodin · April 17, 2020, 5:50pm

https://docs.syncthing.net/users/profiling.html

SanskritFritz · April 17, 2020, 6:11pm

I sent a SIGQUIT when Syncthing spun the CPU, here is the result:

gist.github.com

https://gist.github.com/SanskritFritz/88d5b1003a5576588c6e8c1fdd278570

gistfile1.txt

ápr 17 19:45:49 HomeC syncthing[125363]: SIGQUIT: quit
ápr 17 19:45:49 HomeC syncthing[125363]: PC=0x56366eef1101 m=0 sigcode=0
ápr 17 19:45:49 HomeC syncthing[125363]: goroutine 0 [idle]:
ápr 17 19:45:49 HomeC syncthing[125363]: runtime.futex(0x5636701450c8, 0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7fffedcae7e8, 0x56366ee927f3, ...)
ápr 17 19:45:49 HomeC syncthing[125363]:         /usr/bin/src/runtime/sys_linux_amd64.s:567 +0x21
ápr 17 19:45:49 HomeC syncthing[125363]: runtime.futexsleep(0x5636701450c8, 0x563600000000, 0xffffffffffffffff)
ápr 17 19:45:49 HomeC syncthing[125363]:         /usr/bin/src/runtime/os_linux.go:45 +0x46
ápr 17 19:45:49 HomeC syncthing[125363]: runtime.notesleep(0x5636701450c8)
ápr 17 19:45:49 HomeC syncthing[125363]:         /usr/bin/src/runtime/lock_futex.go:151 +0xa3
ápr 17 19:45:49 HomeC syncthing[125363]: runtime.stopm()

This file has been truncated. show original

AudriusButkevicius · April 17, 2020, 6:14pm

Nothing looks out of the ordinary.

Try running a CPU profile or capture and post the logs around and after the time it goes into this 100% cpu usage cycle (or ideally the whole log leading up to that point).

SanskritFritz · April 17, 2020, 6:34pm

Should I leave the STTRACE=all ?

AudriusButkevicius · April 17, 2020, 6:45pm

Not at first.

Also a CPU profile would be nice.

imsodin · April 17, 2020, 6:59pm

Db compaction is running for 92min:

ápr 17 19:45:49 HomeC syncthing[125363]: goroutine 18 [select, 92 minutes]: ápr 17 19:45:49 HomeC syncthing[125363]: github.com/syndtr/goleveldb/leveldb.(*DB).tCompaction(0xc000120380) ápr 17 19:45:49 HomeC syncthing[125363]: /build/syncthing/src/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20190923125748-758128399b1d/leveldb/db_compaction.go:836 +0x25b

And something about an error compaciting:

ápr 17 19:45:49 HomeC syncthing[125363]: goroutine 15 [select, 92 minutes]: ápr 17 19:45:49 HomeC syncthing[125363]: github.com/syndtr/goleveldb/leveldb.(*DB).compactionError(0xc000120380) ápr 17 19:45:49 HomeC syncthing[125363]: /build/syncthing/src/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20190923125748-758128399b1d/leveldb/db_compaction.go:91 +0xcf

Maybe goleveldb is misbehaving. CPU profile should indeed shed more light.

SanskritFritz · April 17, 2020, 9:49pm

I went to http://localhost:8384/rest/debug/cpuprof and waited for about an hour, but no file download was offered (I checked and it works otherwise when Syncthing is working normally). Finally I gave up and sent a SIGQUIT to Syncthing. I saved the full log of that run but please can I send it privately, it contains lots of personal data I don’t want to expose publicly here.

AudriusButkevicius · April 17, 2020, 10:34pm

You could also try to back up the data, move the index/database away and see of a rebuild helps.

SanskritFritz · April 18, 2020, 7:59am

Thanks for that idea. I stopped Syncthing, deleted /home/user/.config/syncthing/index-v0.14.0.db on both ends and started again. Let’s see it in the long run.

SanskritFritz · April 18, 2020, 8:13am

Nah it didn’t take 10 minutes and one instance drives the CPU at 100% again. Again I tried /rest/debug/cpuprof but no avail.

imsodin · April 18, 2020, 8:56am

The metric “it gets to 100% cpu” is not really helpful: If you remove the database it needs to hash all your data - 100% cpu is expected (good) behaviour for that. Please describe actual problems you’re having. And if you want to be sure that the cpu usage is because of scanning (though there’s no reason stated to doubt it), enable scanner debug logging. I am ok with having a look at logs sent privately to me, though there comes no guarantee of any kinds regarding support with that

SanskritFritz · April 18, 2020, 10:01am

Here is what happened:

Stopped Syncthing
Deleted /home/frank/.config/syncthing/index-v0.14.0.db
Started Syncthing
Syncthing scanned and hashed my data and rebuilt the database. During this time it used the CPU but never at 100% let alone constantly.
Syncthing was done hashing, reconnected all peers in the meanwhile and all became quiet.
After roughly 10 minutes Syncting started to drive one core of the CPU at constant 100% according to Htop. Nothing is logged, there is no file to download at localhost:8384/rest/debug/cpuprof.

I tried to list all details about what happened. This same scenario repeats itself after a restart over and over again randomly at both sites. I have other Syncthing instances with the same version connected to those instances that do not provide that behaviour. Only those two act like this. One is behind a corporate firewall, connecting through a SOCKS proxy, the other is at my home behind a router. No manual port forwarding is used, the instances simply find each other through external relay servers. I now stopped the instance beind the firewall, and left the home instance running. Several hours passed with no problems so far. I’ll wait for a day or two to see how it goes.

SanskritFritz · April 19, 2020, 1:12pm

calmh wrote:

Looks to me like it’s processing network data of some kind (the runnable routines) and is otherwise idle. But this is also a build not made by us, using a Go version we haven’t switched to yet, so I’m not sure. I’d suggest trying one of our binaries from GitHub, just for fun.

AudriusButkevicius wrote:

One of the routines is quic, so try our build first and if that does not help, you could try setting sync protocol listen address to tcp://0.0.0.0:22000 instead of default.

I have replaced /usr/bin/syncthing with your binary (from syncthing-linux-amd64-v1.4.2) on both ends and left them running for a day. I can happily report that everything is normal, Syncthing syncs data and consumes hardly any CPU cycles.

Should we conclude then that the new version of Go (1.14.2) used in Archlinux was the culprit?

Thank you guys for your help.

AudriusButkevicius · April 19, 2020, 2:55pm

I don’t think it’s Go itself, it’s most likely the dependencies or the compiler user by Arch.