Unfortunately this was just an isolated occurrence. This might be linked to the WAL as it was very small when it happened and when the long stop/start happened again the WAL was several gigabytes.
There’s still one major thing to do in this branch to limit the DB load. The deleted files cleanup is still too slow : at least 600ms on our largest folder. Using pagination to process it in steps is possible, if I don’t find a better approach I will use the sequence primary to paginate which would benefit from the index on it.
I think the effort done on the cleanups should be more dynamic. For example I think that you might want to speed things up when the Folder is idle and slow down when there is already a large load on the database (I’ve seen the cleanups delayed by the updateLock for more than 15 seconds on our largest folder, it might be better to slow down under these conditions). Over 8 hours available to do a full cleanup, both states are possible and even very likely several times and it would probably be beneficial to adapt to keep the total load on large systems under control. That may be a bit of overengineering so I’ll probably wait to see if there’s a need for this especially if the code needed isn’t trivial.
Progress status
Current code behavior
Small instances
On small instance the behavior is similar to mainline : the cleanups are done in a single pass most of the time (after an initial ramp-up after startup to detect the appropriate pagination).
Large instances
Our largest folder in the “Preparing to Sync” state continues to make noticeably faster progress :
- needFiles :
- mainline average speed was around 200_000 files/day,
- our branch progresses at more than 3_000_000 files/day,
- bytes to sync :
- mainline average speed was around 400GiB / day,
- our branch is running at around 900GiB / day.
For both values mainline interleaved 8 hours of progress and ~12 hours of stagnation.
Our branch progresses continuously.
Current objectives
- Keep users being able to adapt the cleanup effort to their local situation, which is a bit tricky as for some situations the solution needs to make some hard compromises (how long can Syncthing be waiting for a cleanup, how long before data can be cleaned, how much battery do we use by waking Syncthing on mobiles, …)
- Continue avoiding blocking Syncthing for several seconds and if possible more than 250ms,
- Try to use more information about the current Folder state to decide the rate at which the cleanups should proceed to benefit from low DB usage rates and slow down during heavy DB usage from the rest of the code. This should improve performance overall.
Plans for tunables
I try to refrain from adding too many tunables usually. If I can make the software decide by itself something that works I prefer to do it instead of providing a tunable that could aid users shooting themselves in the foot. But they might end up being necessary. Here is what I have in mind.
Existing DBMaintenanceInterval
I think the DBMaintenanceInterval value should be interpreted as a target to be reached for the time taken to cleanup the DB but not anymore as an interval
between punctual cleanups. The switch from 8 hours to 5 minutes in my PR to reflect the interval between incremental cleanups generated a question as it was not clear so keeping 8 hours and adapting the incremental cleanup interval to be able to reach it is probably better.
But this would be a target that could be missed on the largest folders : unless stalling the sync is an option the best we can do for them is log that we couldn’t process the cleanups fast enough.
Incremental Cleanup DB usage limitation
The arbitrary 250ms target for the incremental cleanup duration might end up being a tunable but for now I don’t see a need for significantly different values. It allows work to be done, is largely enough to do full table cleanups in a single pass on small folders and doesn’t prevent Syncthing sync process from making progress (ie: no hours long stalls on large folders). So coupled with keeping the existing DBMaintenanceInterval value it has the nice advantage of not changing the cleanup process measurably for the huge majority of users.
Minimum delay between incremental cleanups
For the minimum interval between incremental cleanups I’ll hard code something in a constant for now but if it isn’t appropriate on some installations we could make it a tunable too. The minimum would be a hard minimum, the soft minimum could be changed to react to load for example.
TODO
Here is what I believe is still to do before considering the PR for inclusion :
- Fix the deleted files cleanup slow queries,
- Remove the periodicCheckpointLocked related code (it is already inactive) as it is not functional under load and essentially address the same objective that tidy addresses differently,
- cleanup the code and revert the DBMaintenanceInterval meaning to be equivalent to the mainline one.