Syncthing reproducibly slowing down

Hi everybody, first of all i want to thank you for this great tool and all the effort that has been going into this! I am using syncthing in our company to sync a large number(150k) of relatively small (4mb) files from 20+ devices to one server. So far we have been running one device that has handled even more clients but it has recently been starting to have some performance issues. This is why we upgraded to a new, more capable server that is now having a weird issue. After i start syncthing the connection speed is fine and i am syncing with all my clients at the speed they are able to upload with, however after about 10 minutes things change and the syncing comes to an almost complete halt. Then when restarting, everything is up to normal speed again but the issue keeps coming back after around the same time. I am happy to post logs and use custom builds to understand this issue.

There are some known issues with the latest version. Try a release candidate and see of that helps.

Look at resource usage (are you running out of CPU? RAM? Disk I/O?) and possibly grab some profiles or take a support bundle (enable debugging as described in that link, then click the new item in the actions menu) when things grind to a halt, or even better take periodic profiles to show what’s happening over time.

So after following Audrius hint and installing the Pre-Release Version things are behaving slightly different. Speeds are now still the highest when restarted, then dropping and ramping up after an hour or so. Ressources are not the issue, everything seems to be idling. I grabbed a profile using the support bundle while it was sitting at 300kb/s instead of 20mb/s download speed. However while it was grabbing the profile things changed again and speed was ramping up to about 20mb. But as i am writing here it is already slowing down, so there seems to be sth wrong. This is the support file:

Right… So a quick glance doesn’t show anything odd to me. CPU wise it was busy scanning at precisely the moment you took the profile, RAM wise it was swamped by loading and packing the huge log file (there’s been a few complaints about sync failures). If it’s a not-very-fast device I could see scanning on one folder impacting transfers on another.

In the log there’s a few disconnections due to errors/timeouts, but nothing clearly wrong that I could see.

So i was trying to catch the correct time frame where sync speeds remain low and grabbed a profile again. This time speeds stayed low until the very end, when things started to speed up again. However it almost seems like it is stuck on sth and triggering the profile gets it going again. Maybe you can see anything in there. I would be surprised if speed was an issue. I really appreciate your help with this, if there is no real clue to what is going on, can you tell me how to proceed? Delete folders and share again?

Also the log-file seems to be growing to fast, isn’t it? Today it was at 6gb after i had deleted it a few days ago. And it is already at 1.5gb after about 3hrs today. If the disconnects cause these huge log files i don’t know what to do, because i can’t really get rid of peers going away temporarily. They are connected over mobile network, so connection will remain sketchy.

Good to know, that means at least the disconnects are expected. Can you give approximate times when the speed was good and when not to cross-reference that with the logs in the support bundle.

They do. I think the verbosity in the puller (syncing) is over the top (one line for every file with an error, which means all on disconnect). There should be some smart filtering.

Hi, i will generate another support bundle and keep track of the speeds from start to generating it. Will post this tomorrow though, because i have to leave now.

The logs aren’t just from the time you take the support bundle, they go back quite some time.

I took another support package and logged network download after restarting until taking the package. Speed is displayed in Mbit/s.

support-bundle-DZB2QKE-2019-08-22T172038.zip (2.5 MB)

Looks like a CPU graph. Which probably correlates quite well with network activity, though.

Sorry will check again, it says CPU underneath it, you are right…:sweat_smile:

And another try! This time with networkspeed…

support-bundle-DZB2QKE-2019-08-23T160854.zip (4.8 MB)

The increase in incoming bandwidth usage correlates quite well with the connection state of the device F6CY5SM:

[DZB2Q] 15:37:13 INFO: Connection to F6CY5SM-A7VX4ZK-XFQENVV-ZWCQ2IA-VUSKSIK-EDFO6BP-Z7BDSQA-5JJVGQH at 192.168.1.89:22009-89.204.154.70:12110/tcp-server/TLS1.2-TLS
_ECDHE_ECDSA_WITH_CHACHA20_POLY1305 closed: reading message: read tcp 192.168.1.89:22009->89.204.154.70:12110: wsarecv: Ein Verbindungsversuch ist fehlgeschlagen, d
a die Gegenstelle nach einer bestimmten Zeitspanne nicht richtig reagiert hat, oder die hergestellte Verbindung war fehlerhaft, da der verbundene Host nicht reagier
t hat.
[...]
[DZB2Q] 15:38:09 INFO: Device F6CY5SM-A7VX4ZK-XFQENVV-ZWCQ2IA-VUSKSIK-EDFO6BP-Z7BDSQA-5JJVGQH client is "syncthing v1.2.1" named "NETBOX84" at 192.168.1.89:22009-46.114.7.151:53340/tcp-server/TLS1.2-TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
[...]
[DZB2Q] 15:57:38 INFO: Connection to F6CY5SM-A7VX4ZK-XFQENVV-ZWCQ2IA-VUSKSIK-EDFO6BP-Z7BDSQA-5JJVGQH at 192.168.1.89:22009-46.114.7.151:53340/tcp-server/TLS1.2-TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305 closed: reading message: read tcp 192.168.1.89:22009->46.114.7.151:53340: wsarecv: Ein Verbindungsversuch ist fehlgeschlagen, da die Gegenstelle nach einer bestimmten Zeitspanne nicht richtig reagiert hat, oder die hergestellte Verbindung war fehlerhaft, da der verbundene Host nicht reagiert hat.
[...]
[DZB2Q] 15:57:38 INFO: Device F6CY5SM-A7VX4ZK-XFQENVV-ZWCQ2IA-VUSKSIK-EDFO6BP-Z7BDSQA-5JJVGQH client is "syncthing v1.2.1" named "NETBOX84" at 192.168.1.89:22009-2.247.254.232:33120/tcp-server/TLS1.2-TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
[...]
[DZB2Q] 16:06:19 INFO: Connection to F6CY5SM-A7VX4ZK-XFQENVV-ZWCQ2IA-VUSKSIK-EDFO6BP-Z7BDSQA-5JJVGQH at 192.168.1.89:22009-2.247.254.232:33120/tcp-server/TLS1.2-TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305 closed: reading length: read tcp 192.168.1.89:22009->2.247.254.232:33120: wsarecv: Eine vorhandene Verbindung wurde vom Remotehost geschlossen.
[...]
[DZB2Q] 16:07:18 INFO: Device F6CY5SM-A7VX4ZK-XFQENVV-ZWCQ2IA-VUSKSIK-EDFO6BP-Z7BDSQA-5JJVGQH client is "syncthing v1.2.1" named "NETBOX84" at 192.168.1.89:22009-2.247.254.232:45037/tcp-server/TLS1.2-TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305

So it seems the transmission rates are fine, i.e. they correspond with devices disconnecting. Somewhat problematic but expected would be that connection handling in Syncthing is quite slow, as all connections go through a single bottleneck and connection timeout is 5min (looks like a dropped connection needs to time out to be closed, i.e. the writes Syncthing does at a shorter interval to a dropped connection just block, but do not produce an error).

But then why do all connections drop at the same time? At the moment the server is again sitting idle without any incoming traffic. The connected devices show it as diconnected. Then when restarting syncthing on the server everything works just fine.

No idea. Enabling the connections debug facility (via STTRACE enviroment variable or in UI actions->logs) would add more information about what’s going on with connection handling.

The thing is, your latest support bundle shows that things also go back to working without a restart. As I wrote above, detecting dead connection might be slow, so a restart will help (as new connections are established on start).

Currently I am seeing disconnections in the logs, and the only reason I know of for that is your statement that connections are sketchy. So I have currently no pointers at anything that’s wrong (which is not saying that there is nothing wrong).

Hi, just wanted to give an update:

I am now on Version 1.2.2 with the same behaviour. I tried:

  • going back to Version 1.1.2 - still the same

  • using SyncTrayzor(because with drops in syncspeed, i saw that the web-gui was also frozen) - still the same

  • installing a new networkcard(because resetting the networkconnection in windows gave instant speedup again) - still the same

Don’t know if this information is of any value, but the syncspeed seems to be in direct relation to the responsiveness of any (web/synctrayzor) gui. Oh and the sketchy connection to the other devices is no big issue on another server i am running.