Syncthing suddenly slowing down

schneemensch · August 23, 2018, 3:33pm

I am currently experimenting with Syncthing and therefore testing the exact same setup several times. I copy a 18GB folder from one machine to 5 remote devices and log the completion.

This process reliably takes between 1400s and 1550 seconds. But now twice it happened to me that a testrun took much longer (9350s) even though nothing changed in the configuration. I repeated the runs and got the same slow speed. The only way to reset to the previous speed was a restart of the Syncthing daemons on all devices.

Are there any known error, which can produce this sudden reduction in transfer speed?

Edit: I have the audit log running and they grow fairly large to about 400MB in size, but that should not explain the sudden drop.

calmh · August 23, 2018, 5:13pm

Not that I know of.

AudriusButkevicius · August 23, 2018, 6:03pm

Potentially database slows down or something along the lines, and a restart triggers compaction.

For many small files there are a ton of things that could go wrong.

schneemensch · August 24, 2018, 7:39am

Those 18GB consist of 65000 files so we are definitely talking about small files.

Are there any advanced configuration settings which could help me here or should I just implement a daily restart?

AudriusButkevicius · August 24, 2018, 7:48am

I don’t think there is any setting to help. Do you have ignores setup? Over what period does the slowdown happen?

schneemensch · August 24, 2018, 7:54am

I do not have any ignores.

I am not completely sure, but I think after 3-5 test runs which took 25 minutes each, I experienced a sudden slow down. I mostly canceled those slow tests, unshared the folder and deleted it on the remote devices.

The same thing happened to me yesterday and the day before with about 3-5 normal testruns each day.

Should I delete the database after each testrun?

calmh · August 24, 2018, 9:00am

I think you should do some normal performance troubleshooting. Figure out what your normal limiting factor is - is it CPU, disk I/O, network? What does this look like when it goes quickly, and when it goes slowly? Does the limiting factor change? Is it memory usage? If it’s I/O, is it file or database? If CPU, we can grab some profiles. Etc, etc.

There’s no setting that you can change to make things go magically faster when there is a slowdown, or we would make that setting the default. We need to figure out the source of the slowdown.

schneemensch · August 24, 2018, 9:35am

Sure. I will look into it. I do not believe it is CPU or RAM since the machines in the datacenter are quite powerful and basically idling during my testruns, but I will test a bit when the issue arises again.

schneemensch · August 27, 2018, 11:53am

I tested a bit when the issue arose again:

All machines are slowing down about the same time. Therefore I thought it might be related to the config and database folders which are all placed on the same centralized filer, but I saw the opposite. When syncthing is slow there is almost no traffic on the filer.
When one receiving machine is restarted, it receives the data in the usual speed. This suggests that it is no problem of the sending machine, but the receiving machines.
So far I have only noticed the problem on linux machines
Neither CPU, memory or network usage are high on any of the machines.
A simple restart always fixes the problem without deleting the database, config or log files.

I still do not know why the slow down occurs. I will probably do a daily restart now to fix the outcome of the issue.

calmh · August 27, 2018, 12:58pm

Restart of Syncthing or reboot?

schneemensch · August 27, 2018, 1:48pm

restart of Syncthing

calmh · August 27, 2018, 2:26pm

Odd.

schneemensch · September 12, 2018, 8:37am

I want to look further into this, because I am fearing that it also appears during my production syncs and not only during my test runs.

It is 100% reproducible that the download and upload rate increases by a factor of up to 1000 when I restart the clients during a slowed down test run. It also works to restart the clients immediately before the folder is shared.

Which debug logging options could I activate to get more information about the reason for the slowdown?

AudriusButkevicius · September 12, 2018, 11:23am

Do you have rate limits enabled in general? I can hwlp you debug this by producing some custom builds.

schneemensch · September 12, 2018, 11:33am

No I don’t.

I am not sure if I can work with custom builds since SyncThing is deeply integrated into our scripts and I do not want to use custom builds on the production system.

I also feel not very comfortable to share logs without cleaning them, because of sensible file names.

AudriusButkevicius · September 12, 2018, 7:57pm

Right, so I don’t really see how we can help you, if we can’t get the logs nor we can run custom builds.

schneemensch · September 13, 2018, 7:37am

Maybe you could start by answering my question which debug logging options are relevant.

I did not mean that I cannot share any logs. I just need to make sure that they are sanitized by creating a test directory with different file names.

AudriusButkevicius · September 13, 2018, 7:56am

The relevant debugging in this case will need a custom build. I suspect this could be caused by the limiter, but there is no logging around that area that would help me prove my suspicion.

schneemensch · September 13, 2018, 8:14am

Ok, thanks. I did not understand that completely.

I will look if I can create a setup which runs the custom build separately from our production system

calmh · September 13, 2018, 2:40pm

If we suspect the limiter might be at fault, run without limits for a while and see what happens?

Another thing that could be interesting… When you see the slowdown, don’t restart. Instead pause and resume the device, causing just a connection reset. See if this makes a difference. If it does, it narrows down the amount of internal state that could be at fault.