Pausing upload leads to busy waiting loop (100% CPU) in star-shaped configuration

Hi, I was pointed here from the issue tracker in order to better find out what the root of the problem might be.

I am using Debian stable, and the problem occurs both with the version provided by the distribution as well as the one from the Syncthing repository (that is, 1.19.1).

I am operating three devices (A, B, C) in a star-shaped configuration. That is, A and B know each other and so do B and C, but not A and C. Initially, I put a rather large folder in the default directory (that is shared among A and B as well as among B and C) on device A, but because I noticed that B and C would not have enough disk space, I quickly paused it. However, a few files were already shared. Pausing the sync on device A caused device B to go into a busy waiting loop with 100% CPU usage.

On its web interface, B says that A is up-to-date, and it claims to be syncing with C, but no network traffic is flowing, and there is no progress in the sync process. Device A obviously says that the upload is paused, while C thinks it’s up-to-date. Terminating Syncthing on device C does not change anything, but when terminating Syncthing on device A, it announces on device B: INFO: Folder "Default Folder" (default) isn't making sync progress - retrying in 1m25s. In that stage, it still uses a lot of CPU every few minutes, but at least it is not constantly up.

As I am fairly new to Syncthing (just started today), I do not really understand the dynamics behind this, but the indefinite busy waiting with 100% CPU usage while doing nothing useful certainly does not seem right.

If B started syncing files and you pulled the rug by making all of them unavailable, it’ll need to process (and fail) each one a few times (retrying), until at some point slowing down the retries. I could see this eating some CPU if there are a lot of files.

I guess getting the logs would be useful, but I would not expect this cpu use to be constant, and just be one-off.

If B started syncing files and you pulled the rug by making all of them unavailable, it’ll need to process (and fail) each one a few times (retrying), until at some point slowing down the retries. I could see this eating some CPU if there are a lot of files.

This is what happens when I disconnect A. In this case, it says on B:

13:32:56 INFO: Puller (folder "Default Folder" (default), item "[...]"): syncing: no connected device has the required version of this file
13:32:56 INFO: Puller (folder "Default Folder" (default), item "[...]"): syncing: no connected device has the required version of this file
...
13:32:56 INFO: "Default Folder" (default): Failed to sync 2074 items
13:32:56 INFO: Folder "Default Folder" (default) isn't making sync progress - retrying in 1m20s.

However, when A is connected (but paused), command line output is silent and CPU usage is constant at 100%.

I guess getting the logs would be useful, but I would not expect this cpu use to be constant, and just be one-off.

You mean the command line output? When I connect A, it only says on B:

13:33:33 INFO: Established secure connection to [A] at [...]/tcp-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256
13:33:33 INFO: [A] client is "syncthing v1.19.1" named "[...]" at [...]/tcp-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256
13:33:33 INFO: Failed to exchange Hello messages with [A] at [...]/tcp-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256: write tcp [...]: write: connection reset by peer

I’m not sure if the last line is relevant, but on the web inferface, it says both A and C are connected and that it’s syncing with C – which it is obviously not doing (how could it, the files are not there and A is paused).

And yes, CPU stays at 100% (on a single core) indefinitely without any further messages in the output.

Getting full logs from both sides when this happens would be useful.

I think you trying to selectively pick lines is not going to work.

OK, here are the full logs. I replaced the real hostnames with “deviceA/B/C”, respectively, but apart from that, everything is original.

I first started A, then B, then C. When they were all running, I pressed “Resume All” on the web interface of B. All machines have NTP running, so timings should be correct. Sorry I had to put them all in one file, but the forum software does not give me permission to upload 3 files.

syncthing.all.log (9.7 KB)

I don’t see any error messages you mentioned. I guess I didn’t express myself clearly, but I guess we are interested in the logs at the point the issue happens (and some time after that). It would also be great if you could correlate the timestamps of the CPU spikes with the logs.

I don’t see any error messages you mentioned.

You mean the one where it says Failed to sync 2074 items? That only happens when A is down, and it seems to be doing the right thing in this case, namely waiting for a certain period of time (in this case, 1m20s) to not constantly load the CPU.

The logs are from the scenario that A is up but paused.

I guess I didn’t express myself clearly, but I guess we are interested in the logs at the point the issue happens (and some time after that). It would also be great if you could correlate the timestamps of the CPU spikes with the logs.

The unusual CPU activity on B started at 15:39:19, when I pressed “Resume All” in the web interface on B (with A still being paused). While it seems expected that there would be some CPU activity, it never stopped – without any further messages in the output. I can let it run for a few more hours, but I don’t expect a change.

In a completely clean setup, with no existing folders and hopefully just a few files, could you try and reproduce this reliably and write the steps down? Ideally on sensibly powered machines?

To me it seems that device B is a very weak device (based on hashing performance) and that once it got the new index, it’s just probably spending time applying and compacting it.

If there’s a lot of cpu activity but no logs generated, grabbing a cpu profile would be good(or generally if there’s a prolonged time of high cpu): Profiling — Syncthing v1.19.1-8-gaf93ba9 documentation

To me it seems that device B is a very weak device (based on hashing performance) and that once it got the new index, it’s just probably spending time applying and compacting it.

This might indeed be the case. It did let A and B run overnight – A is a laptop and B is a NAS – and B did eventually get into the Failed to sync 2074 items state here as well. Does it matter that C was not running during that time? I don’t know, but I will do further experiments.

So, yes, it could just have been a performance issue.

Alright, I changed my setup and connected A directly to C, where C is an old but still somewhat beefy desktop computer (with an i5-2500K). The issue does occur as well, but C only takes a few minutes to get to the Failed to sync point, rather than the hours that it takes for B (which has an Allwinner A20).

To me, this means that the case is closed and that it is a performance issue rather than a bug. If you have any further questions, feel free to contact me.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.