Pausing upload leads to busy waiting loop (100% CPU) in star-shaped configuration

sdschulze · March 10, 2022, 1:12am

Hi, I was pointed here from the issue tracker in order to better find out what the root of the problem might be.

I am using Debian stable, and the problem occurs both with the version provided by the distribution as well as the one from the Syncthing repository (that is, 1.19.1).

I am operating three devices (A, B, C) in a star-shaped configuration. That is, A and B know each other and so do B and C, but not A and C. Initially, I put a rather large folder in the default directory (that is shared among A and B as well as among B and C) on device A, but because I noticed that B and C would not have enough disk space, I quickly paused it. However, a few files were already shared. Pausing the sync on device A caused device B to go into a busy waiting loop with 100% CPU usage.

On its web interface, B says that A is up-to-date, and it claims to be syncing with C, but no network traffic is flowing, and there is no progress in the sync process. Device A obviously says that the upload is paused, while C thinks it’s up-to-date. Terminating Syncthing on device C does not change anything, but when terminating Syncthing on device A, it announces on device B: INFO: Folder "Default Folder" (default) isn't making sync progress - retrying in 1m25s. In that stage, it still uses a lot of CPU every few minutes, but at least it is not constantly up.

As I am fairly new to Syncthing (just started today), I do not really understand the dynamics behind this, but the indefinite busy waiting with 100% CPU usage while doing nothing useful certainly does not seem right.

calmh · March 10, 2022, 6:52am

If B started syncing files and you pulled the rug by making all of them unavailable, it’ll need to process (and fail) each one a few times (retrying), until at some point slowing down the retries. I could see this eating some CPU if there are a lot of files.

AudriusButkevicius · March 10, 2022, 8:14am

I guess getting the logs would be useful, but I would not expect this cpu use to be constant, and just be one-off.

sdschulze · March 10, 2022, 12:50pm

If B started syncing files and you pulled the rug by making all of them unavailable, it’ll need to process (and fail) each one a few times (retrying), until at some point slowing down the retries. I could see this eating some CPU if there are a lot of files.

This is what happens when I disconnect A. In this case, it says on B:

13:32:56 INFO: Puller (folder "Default Folder" (default), item "[...]"): syncing: no connected device has the required version of this file
13:32:56 INFO: Puller (folder "Default Folder" (default), item "[...]"): syncing: no connected device has the required version of this file
...
13:32:56 INFO: "Default Folder" (default): Failed to sync 2074 items
13:32:56 INFO: Folder "Default Folder" (default) isn't making sync progress - retrying in 1m20s.

However, when A is connected (but paused), command line output is silent and CPU usage is constant at 100%.

I guess getting the logs would be useful, but I would not expect this cpu use to be constant, and just be one-off.

You mean the command line output? When I connect A, it only says on B:

13:33:33 INFO: Established secure connection to [A] at [...]/tcp-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256
13:33:33 INFO: [A] client is "syncthing v1.19.1" named "[...]" at [...]/tcp-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256
13:33:33 INFO: Failed to exchange Hello messages with [A] at [...]/tcp-server/TLS1.3-TLS_CHACHA20_POLY1305_SHA256: write tcp [...]: write: connection reset by peer

I’m not sure if the last line is relevant, but on the web inferface, it says both A and C are connected and that it’s syncing with C – which it is obviously not doing (how could it, the files are not there and A is paused).

And yes, CPU stays at 100% (on a single core) indefinitely without any further messages in the output.

AudriusButkevicius · March 10, 2022, 1:26pm

Getting full logs from both sides when this happens would be useful.

I think you trying to selectively pick lines is not going to work.

sdschulze · March 10, 2022, 3:05pm

OK, here are the full logs. I replaced the real hostnames with “deviceA/B/C”, respectively, but apart from that, everything is original.

I first started A, then B, then C. When they were all running, I pressed “Resume All” on the web interface of B. All machines have NTP running, so timings should be correct. Sorry I had to put them all in one file, but the forum software does not give me permission to upload 3 files.

syncthing.all.log (9.7 KB)

AudriusButkevicius · March 10, 2022, 8:44pm

I don’t see any error messages you mentioned. I guess I didn’t express myself clearly, but I guess we are interested in the logs at the point the issue happens (and some time after that). It would also be great if you could correlate the timestamps of the CPU spikes with the logs.

sdschulze · March 10, 2022, 10:00pm

I don’t see any error messages you mentioned.

You mean the one where it says Failed to sync 2074 items? That only happens when A is down, and it seems to be doing the right thing in this case, namely waiting for a certain period of time (in this case, 1m20s) to not constantly load the CPU.

The logs are from the scenario that A is up but paused.

I guess I didn’t express myself clearly, but I guess we are interested in the logs at the point the issue happens (and some time after that). It would also be great if you could correlate the timestamps of the CPU spikes with the logs.

The unusual CPU activity on B started at 15:39:19, when I pressed “Resume All” in the web interface on B (with A still being paused). While it seems expected that there would be some CPU activity, it never stopped – without any further messages in the output. I can let it run for a few more hours, but I don’t expect a change.

AudriusButkevicius · March 10, 2022, 10:40pm

In a completely clean setup, with no existing folders and hopefully just a few files, could you try and reproduce this reliably and write the steps down? Ideally on sensibly powered machines?

To me it seems that device B is a very weak device (based on hashing performance) and that once it got the new index, it’s just probably spending time applying and compacting it.

imsodin · March 11, 2022, 8:02am

If there’s a lot of cpu activity but no logs generated, grabbing a cpu profile would be good(or generally if there’s a prolonged time of high cpu): Profiling — Syncthing v1.19.1-8-gaf93ba9 documentation

sdschulze · March 11, 2022, 11:45am

To me it seems that device B is a very weak device (based on hashing performance) and that once it got the new index, it’s just probably spending time applying and compacting it.

This might indeed be the case. It did let A and B run overnight – A is a laptop and B is a NAS – and B did eventually get into the Failed to sync 2074 items state here as well. Does it matter that C was not running during that time? I don’t know, but I will do further experiments.

So, yes, it could just have been a performance issue.

sdschulze · March 11, 2022, 11:47pm

Alright, I changed my setup and connected A directly to C, where C is an old but still somewhat beefy desktop computer (with an i5-2500K). The issue does occur as well, but C only takes a few minutes to get to the Failed to sync point, rather than the hours that it takes for B (which has an Allwinner A20).

To me, this means that the case is closed and that it is a performance issue rather than a bug. If you have any further questions, feel free to contact me.

system · April 10, 2022, 11:47pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.