[SOLVED] Web GUI becoming unresponsive for periods of 5-10 minutes

First of all, I am sorry for asking for help again.

This topic is somewhat related to Unresponsive Web GUI - pause and resume is not registered and Deadlock on folder unsubscribe · Issue #6559 · syncthing/syncthing · GitHub.

Basically, once in a while, the Web GUI becomes unresponsive for about 5-10 minutes. When stuck, all folders are listed as “Unknown”, and no input is registered in the browser. However, the log file seems to be still populated. Then, the GUI suddenly comes back to life again.

@AudriusButkevicius asked me to do

On Linux you can send a SIGQUIT (but ideally after it has been stuck for some time), on Windows, I think you’d have to run with pprof enabled (docs.syncthing.net/users/syncthing.html STPROFILER) and then get the traces from http://<address>/debug/pprof/goroutine?debug=2

and here it is, right after the GUI had been stuck for 5 minutes, and then became responsive again.

full goroutine stack dump.txt (146.1 KB)

Is there anything useful there that would explain the culprit?

Thank you very much for all the help in advance.

I don’t see anything obviously wrong in that trace…

if i understand this right and you got the trace after the gui became responsive again that’s likely too late. being responsive again likely means whatever was blocking progress before is now gone.

2 Likes

So this came from pprof, right? Is that (the /debug/pprof/…) endpoint also unresponsive when this happens, so that you can’t get a trace when it’s stuck?

No no, it was just me unsure when to do the trace. That one was indeed done after the UI had become responsive again. I will try to do another one once it gets stuck again. The /debug/pprof/ endpoint seemed accessible all the time, regardless of the state of the UI.

I have also observed that the lockups seem to happen while synchronising data between LAN-connected local devices.

The UI locked up again on the Windows computer as soon as one of my Android devices connected to it. It seems that the lockups take place during the initial connection between Windows and the other devices. Once the connection stabilises, synchronisation seems to be going on smoothly.

This is a trace dump taken while the UI was completely unresponsive.

full goroutine stack dump.txt (646.2 KB)

there’s a bunch of request both ways. you might be affected by https://github.com/syncthing/syncthing/issues/6583 if that isn’t an actual deadlock but “just” a very long lasting, bad lock. to be sure I’d need to check at a computer, earliest tomorrow. @AudriusButkevicius might be able to correct or confirm off the top of his head.

1 Like

It seems it’s connected via quic rather than TCP which might mean something, but I also believe you run your devices with abysmally small rate limits, correct?

The actual lockup is caused by:

goroutine 24 [select, 3 minutes]:
github.com/syncthing/syncthing/lib/protocol.(*rawConnection).send(0xc0001b4410, 0x10f9720, 0xc000400380, 0x10f9d60, 0xc0093ec060, 0x0, 0x0)
	C:/syncthing/lib/protocol/protocol.go:656 +0x15c
github.com/syncthing/syncthing/lib/protocol.(*rawConnection).DownloadProgress(0xc0001b4410, 0x10f9720, 0xc000400380, 0xc00d1d1f30, 0xb, 0xc009071d60, 0x1, 0x1)
	C:/syncthing/lib/protocol/protocol.go:344 +0xbe
github.com/syncthing/syncthing/lib/model.(*ProgressEmitter).sendDownloadProgressMessagesLocked(0xc0001e8280, 0x10f9720, 0xc000400380)
	C:/syncthing/lib/model/progressemitter.go:152 +0x59c
github.com/syncthing/syncthing/lib/model.(*ProgressEmitter).serve(0xc0001e8280, 0x10f9720, 0xc000400380)
	C:/syncthing/lib/model/progressemitter.go:92 +0x5d5
github.com/syncthing/syncthing/lib/util.AsService.func1(0x10f9720, 0xc000400380, 0xc000f5c420, 0x19)
	C:/syncthing/lib/util/utils.go:183 +0x40
github.com/syncthing/syncthing/lib/util.(*service).Serve(0xc000414b40)
	C:/syncthing/lib/util/utils.go:247 +0x149
github.com/thejerf/suture.(*Supervisor).runService.func1(0xc000407950, 0xc000000000, 0x10ef660, 0xc0001e8280)
	C:/go/pkg/mod/github.com/thejerf/suture@v3.0.2+incompatible/supervisor.go:600 +0x57
created by github.com/thejerf/suture.(*Supervisor).runService
	C:/go/pkg/mod/github.com/thejerf/suture@v3.0.2+incompatible/supervisor.go:588 +0x62
goroutine 40074 [semacquire]:
sync.runtime_SemacquireMutex(0xc0003f87cc, 0xc00549e600, 0x1)
	c:/go/src/runtime/sema.go:71 +0x4e
sync.(*Mutex).lockSlow(0xc0003f87c8)
	c:/go/src/sync/mutex.go:138 +0x103
sync.(*Mutex).Lock(0xc0003f87c8)
	c:/go/src/sync/mutex.go:81 +0x4e
github.com/syncthing/syncthing/lib/model.(*ProgressEmitter).BytesCompleted(0xc0001e8280, 0xc0094088bb, 0xb, 0x0)
	C:/syncthing/lib/model/progressemitter.go:263 +0x5c
github.com/syncthing/syncthing/lib/model.(*model).FolderProgressBytesCompleted(0xc0003ebb00, 0xc0094088bb, 0xb, 0xf)
	C:/syncthing/lib/model/model.go:853 +0x4d
github.com/syncthing/syncthing/lib/model.(*folderSummaryService).Summary(0xc000076b40, 0xc0094088bb, 0xb, 0x6, 0xc000152b88, 0xc0094e7924)
	C:/syncthing/lib/model/folder_summary.go:112 +0x951
github.com/syncthing/syncthing/lib/api.(*service).getDBStatus(0xc000152dc0, 0x10f7260, 0xc00016aa80, 0xc00018ef00)
	C:/syncthing/lib/api/api.go:692 +0x71
net/http.HandlerFunc.ServeHTTP(0xc000f4a150, 0x10f7260, 0xc00016aa80, 0xc00018ef00)
	c:/go/src/net/http/server.go:2007 +0x4b
net/http.(*ServeMux).ServeHTTP(0xc000be0d00, 0x10f7260, 0xc00016aa80, 0xc00018ef00)
	c:/go/src/net/http/server.go:2387 +0x1c4
github.com/syncthing/syncthing/lib/api.getPostHandler.func1(0x10f7260, 0xc00016aa80, 0xc00018ef00)

Which is effectively a component that the GUI needs to access, which is stuck on sending data.

2 Likes

Hmm, not really.

I have only four devices connected at the moment. One of them is located in a different country, on a slow connection, and is limited to 25 KB/s upload (but not download).

The other three devices are local, all operating under the same WLAN. Two of them are Windows, one is Android. The lockups seem to happen on one of the Windows computers, when the other Windows or Android device connects, e.g. after a system reboot.

Also, all the local devices have their static IPs set in Syncthing. Only the remote computer is connection through QUIC. The local devices have no bandwidth limits set.

Could such a one remote device cause these lockups?

Right, 25kb/s is abysmally low, so this is most likely the cause.

Yes, one device can cause these lockups.

I still think it’s worth a ticket on github, as we could reduce the lock contention in that area. Yet I suspect fixing that will just point to some other chokepoint.

2 Likes

I get this unresponsiveness fairly regularly. Sometimes a browser refresh will kick the GUI back into life and I have too previously posted on the matter and was pointed to USB IO as a probable cause (but I have also tried on Raid 10 and SATA drives and get a similar but less often result). However my feeling is that St internally gets bogged down with processing / scanning / syncing folders that it no longer has the time to update the GUI. Usually trying to kill St gets the access is denied and thus I have to hard restart.

But equally I think St is still busy as there is often some hdd activity and given enough time, sometimes even as long as 6+ hours the GUI does come back. So I suspect it’s more likely to happen on installations where there’s lots of Sync folders, files and larger volumes of data, including large single files, eg, 1Tb backups.

I tend to be adopting a set and forget attitude to St to allow it to get on with the job.

Please don’t take anything above as being critical. The devs do a great job and must get fed up with us end users moaning about minor things :slight_smile:

3 Likes

Yeah, but we are talking about an old ADSL connection, which is limited to 1 Mbit/s upload bandwidth, and even that is a stretch. I have no choice but to limit it, just in order to be able to use the Internet for other things.

I have opened an issue on GitHub, and also added a new trace dump there.

You can disable temporary indexes for all folders on that device, and see if that helps. It would be in the advanced config section.

1 Like

I have disabled them, although will that help if the device in question mostly uploads files, which are either already there, or are being newly created by it? Right now, the UI is still responsive. I will have to check again tomorrow, after some time has passed.

Also, just a sidenote, but the explanation of disableTempIndexes in the Docs does not really explain what it is in plain and easy language. I had to search around to understand what it does, and fortunately stumbled upon your explanation in the forums, which uses a much more accessible wording. I have also found a video by calmh, where he explained it very well, and in details. It would be really nice to have this included in the documentation.

Again, this is not a complaint, just an observation. I may later try to do a pull request to the docs repo, although I am also not the best when it comes to writing such documents.

1 Like

It’s the downloads that exercise that code path, so we announce that we have some blocks in a temporary file, and that’s the part that can’t get through that 25kb/s limit.

There might be a similar choke-point on receiving those, but I think the flag controls both, sending and receiving.

Documentation is community maintained, so feel free to open a PR improving it in a way you see fit.

1 Like

Just to confirm, if I have this kind of a setup

image

should I set disableTempIndexes both on E and F?

A is the one, where the GUI locks up. A and E are constantly syncing files with one another. F is mostly static, only syncing a few files once upon a time. E and F are also using the same slow Internet connection.

I guess you have to enable it on all, A, E and F. You could also try the build off the PR that I’ve opened, see if it gets better.

Hmm, I will try to check the PR and the other settings later on. Right now, I enabled disableTempIndexes only on A yesterday, and have not experienced any lockups since then.

In fact, the GUI at the moment is working super fast. I have not seen it refresh so quickly ever before, that is with all devices connected and folders being synced.

Also, do you mean https://github.com/syncthing/syncthing/pull/6589?

Yes

Thank you. I will apply and test the PR today.

I can also confirm that there have been zero lockups for 2 days, since enabling disableTempIndexes for all folders on Device E only.