Is Syncthing v2 with very large folders possible ?

Speaking about checkpoints, I’m more and more convinced that the auto checkpointer is a good thing. Otherwise we need to deal with some kind of checkpoint heuristic everywhere. A simple timer based approach is perfectly fine to enforce WAL restart or truncation, but far too less granular to replace the auto mechanism.

The other thing that’s a bit strange is our usage of journal_size_limit. In essence it’s supposed to be a soft limit that is enforced when the WAL is empty. But we seem to only set it as part of our maintenance and the limit is so tiny that we don’t really profit from reusing what’s left after truncation. Even worse, we set it prior to invoking wal_checkpoint(TRUNCATE) which explicitly truncates the WAL to 0 bytes.

IMHO we should set a reasonable limit (e.g 32MB) as part of our connection setup PRAGMAs. This is still low enough to avoid wasting lots of disk space, but will lead to better reuse of the WAL file as shrinking and growing a file decreases performance.

The auto checkpointer is active by default.

So my current understanding is that the only real problem is the truncate that is mostly non functional on instances that are reading when it is triggered unless busy_timeout is set.

journal_size_limit usage is a bit odd, on my branch it is still set when opening the DB but not anywhere else.

I completely deactivated the periodicCheckpointLocked calls that did what you described (journal_size_limit then truncate). From my point of view this was added complexity for no benefit I could find and probably a waste of resources on busy instances.

I’ve restarted our large instance with busy_timeout and with updateLock disabled as I couldn’t make the code break (no SQLITE_BUSY have been returned) on my laptop.

I’ve yet to study this more but my preliminary opinion is that journal_size_limit and truncate are redundant when auto_checkpoint is active.

The ideal value for journal_size_limit is probably dependent on the activity : you don’t want a too small value on busy instances forcing the readers to stop temporarily for the checkpointer to do its job.

In theory you could use a NOOP checkpoint to probe the WAL to have an idea of the rate at which it fills up and adjust the journal_size_limit to limit the reader freezes to an acceptable frequency but it’s a bit tricky and I wouldn’t go this path unless there’s a measurable performance problem to not tuning journal_size_limit.

SQLite lacks a bit in the performance monitoring department but its design doesn’t make it easy for their devs.

The limit is 100% soft. Transactions are free to grow the WAL way beyond it. It’s only relevant once the checkpointer needs to decide what to do with the file after finishing its job.

With our current limit of 8MB you’ll end up pointlessly truncating and regrowing the WAL file.

For tiny databases it’s also not much of a problem as we’re using a truncating checkpoint as part of our maintenance. If we never grow the WAL anywhere near the limit, it doesn’t really matter.

According to the documentation of journal_size_limit :

Each time a transaction is committed or a WAL file resets, SQLite compares the size of the rollback journal file or WAL file left in the file-system to the size limit set by this pragma and if the journal or WAL file is larger it is truncated to the limit.

I understand this as : when any modification to the DB is finished (single query done or transaction closed), the checkpointer code is automatically called to reduce the size of the WAL up to this limit. Which means it basically does a RESTART and partial truncate.

That’s one of the things I will verify in the following hours : until this morning our large instance didn’t have busy_timeout, so the checkpointer was basically not able to do its work and the WAL was nearly at 30GiB. If I understand correctly now the WAL should only grow briefly above the limit which is 8MiB. I still have a 24h period full truncate active in the code, so every 24 hours it should suddenly be truncated to 0 and grow back to around 8MiB.

That’s the theory, time will tell…

Another pragma that could help is :

PRAGMA synchronous = NORMAL

This relaxes the durability guarantees of SQLite. The database can’t be corrupted but the last modifications can be lost in the case of an hardware crash.

Syncthing is resilient to losing last modifications : at each start it rescans the local data and fetches the information about remote device own data that it doesn’t have yet.

Currently the default mode is :

PRAGMA synchronous = FULL

This means that SQLite calls fsync or equivalent after writing each modification to the DB in the WAL. The change would only use fsync for checkpoints.

I’ll add it to my branch for testing.

I don’t know for sure yet which change triggered this and I’ll have to reproduce it later to confirm but Syncthing in my branch is now way faster when stopping and starting.

Before the init script timed-out on the stop after 10 seconds on our system, no the stop is nearly instant. The start displayed a “migration in progress” or something similar on the Web UI that could remain for hours (usually the DB WAL shrinked after this). Now the GUI is usable right away.

I suspect busy_timeout is the indirect cause because the WAL was kept very small which means the work to be done when loading/shutting down access to the DB is probably far less. That said Syncthing was still in the Scan phase and not yet “Preparing to Sync” I’m not sure the DB WAL grew much in this state before (hence the “I’ll have to reproduce it”).

For reference, Syncthing was running with these two other pragmas :

  • “cache_size = -16384”,
  • “temp_store = MEMORY”

Unfortunately this was just an isolated occurrence. This might be linked to the WAL as it was very small when it happened and when the long stop/start happened again the WAL was several gigabytes.

There’s still one major thing to do in this branch to limit the DB load. The deleted files cleanup is still too slow : at least 600ms on our largest folder. Using pagination to process it in steps is possible, if I don’t find a better approach I will use the sequence primary to paginate which would benefit from the index on it.

I think the effort done on the cleanups should be more dynamic. For example I think that you might want to speed things up when the Folder is idle and slow down when there is already a large load on the database (I’ve seen the cleanups delayed by the updateLock for more than 15 seconds on our largest folder, it might be better to slow down under these conditions). Over 8 hours available to do a full cleanup, both states are possible and even very likely several times and it would probably be beneficial to adapt to keep the total load on large systems under control. That may be a bit of overengineering so I’ll probably wait to see if there’s a need for this especially if the code needed isn’t trivial.

Progress status

Current code behavior

Small instances

On small instance the behavior is similar to mainline : the cleanups are done in a single pass most of the time (after an initial ramp-up after startup to detect the appropriate pagination).

Large instances

Our largest folder in the “Preparing to Sync” state continues to make noticeably faster progress :

  • needFiles :
    • mainline average speed was around 200_000 files/day,
    • our branch progresses at more than 3_000_000 files/day,
  • bytes to sync :
    • mainline average speed was around 400GiB / day,
    • our branch is running at around 900GiB / day.

For both values mainline interleaved 8 hours of progress and ~12 hours of stagnation.

Our branch progresses continuously.

Current objectives

  • Keep users being able to adapt the cleanup effort to their local situation, which is a bit tricky as for some situations the solution needs to make some hard compromises (how long can Syncthing be waiting for a cleanup, how long before data can be cleaned, how much battery do we use by waking Syncthing on mobiles, …)
  • Continue avoiding blocking Syncthing for several seconds and if possible more than 250ms,
  • Try to use more information about the current Folder state to decide the rate at which the cleanups should proceed to benefit from low DB usage rates and slow down during heavy DB usage from the rest of the code. This should improve performance overall.

Plans for tunables

I try to refrain from adding too many tunables usually. If I can make the software decide by itself something that works I prefer to do it instead of providing a tunable that could aid users shooting themselves in the foot. But they might end up being necessary. Here is what I have in mind.

Existing DBMaintenanceInterval

I think the DBMaintenanceInterval value should be interpreted as a target to be reached for the time taken to cleanup the DB but not anymore as an interval between punctual cleanups. The switch from 8 hours to 5 minutes in my PR to reflect the interval between incremental cleanups generated a question as it was not clear so keeping 8 hours and adapting the incremental cleanup interval to be able to reach it is probably better.

But this would be a target that could be missed on the largest folders : unless stalling the sync is an option the best we can do for them is log that we couldn’t process the cleanups fast enough.

Incremental Cleanup DB usage limitation

The arbitrary 250ms target for the incremental cleanup duration might end up being a tunable but for now I don’t see a need for significantly different values. It allows work to be done, is largely enough to do full table cleanups in a single pass on small folders and doesn’t prevent Syncthing sync process from making progress (ie: no hours long stalls on large folders). So coupled with keeping the existing DBMaintenanceInterval value it has the nice advantage of not changing the cleanup process measurably for the huge majority of users.

Minimum delay between incremental cleanups

For the minimum interval between incremental cleanups I’ll hard code something in a constant for now but if it isn’t appropriate on some installations we could make it a tunable too. The minimum would be a hard minimum, the soft minimum could be changed to react to load for example.

TODO

Here is what I believe is still to do before considering the PR for inclusion :

  • Fix the deleted files cleanup slow queries,
  • Remove the periodicCheckpointLocked related code (it is already inactive) as it is not functional under load and essentially address the same objective that tidy addresses differently,
  • cleanup the code and revert the DBMaintenanceInterval meaning to be equivalent to the mainline one.
3 Likes

This is the effect of

PRAGMA synchronous = NORMAL

or do you feel like of also something else?

(except the obvious case of interleaving with maintenance)

Clearly this is in a large part due to paginating the cleanups in maintenance. One maintenance run of mainline takes ~12 hours on our largest folder which blocks all sync progress.

But this isn’t the only reason for faster sync : the current code is faster with the maintenance running constantly at regular intervals than mainline in the periods between maintenance runs.

And this has been measurable since the beginning with a noticeable increase in speed beginning the 10th of February. This is just after I disabled the call to periodicCheckpointLocked (code commited the 9th and probable restart the 10h mid-day). This was called from the regular Sync code not maintenance.

There are modifications that should have helped but I can’t really isolate them : I would have to restart Syncthing for each of them and the initial Scan still costs us around 7 hours for our 2 folders. Even if this is at least 3 times faster than mainline, it’s still too slow for doing frequent benchmarks on real data.

At the current rate the “Preparing to Sync state” should finish in 3 days. That leaves me with a bit of time to test some other modifications on fast HDD Raid.

I have two other servers to migrate but the underlying storage is mostly DC SSDs with low latency guarantees :

  • the first statistically reads from SSD ~95% of the time, HDD 5% and write to both HDD and SSD,
  • the second is 100% SSD.

It won’t be as easy to spot SQLite bottlenecks on them and few people can afford these kinds of storage today (these are tens of SSDs that we fortunately had the opportunity to purchase before IA investors cornered the semiconductor market). So optimizing for these is probably not the best idea for common Syncthing users even with large volumes.

1 Like

I’m testing two late modifications :

  • cache_size auto-tuning (doesn’t change anything on small instances with less than 8000 files per folder, allocate enough cache to fit the files indexes in memory based on files row count estimate).
  • disable cache_spill : I noticed in SQLite documentation that it can happen during long transactions (which I strongly suspect happens during folder updates given the timings of the GC cleanups blocked by folder updates). When it happens it takes an EXCLUSIVE LOCK on the database which I assume is not good for readers…

The theory behind the cache_size tuning is the following :

  • all the garbage collection tasks use queries that are sped up by one or more indexes on files,
  • if these indexes can be made to remain in cache the garbage collection queries should be a bit faster and they are a huge part of the load on large folders.

I inventoried all indexes involved, estimated how much they occupy on disk per row and when opening a folder DB the code now computes if based on the files row count estimate the default cache is enough to keep these indexes in memory. If not it uses “PRAGMA cache_size = -size” (the - isn’t a typo but an oddity in the cache_size PRAGMA) to set a value capable of doing so.

For folders with less than 8000 files it changes nothing, for 32 million files or more folder it allows using 8GiB. From my experience if you have folders with 32 million files unless they are completely unused outside of Syncthing 8GiB isn’t much RAM to ask for…

For examples :

  • on my laptop I only have one folder with more than 8000 files and not much more : the cache_size changes from a default of 1.95MiB to approximately 2.7MiB.
  • on our server, one folder has a cache_size of ~7GiB and the other ~2GiB.

@calmh : just saw your PR for cache_size configuration and commented in it to signal that I’m working on cache_size trying to speed up the cleanups a bit more. Maybe we should coordinate ?

I’ve serious doubts about cache_spill : in WAL mode I don’t see a reason for a systematic exclusive lock on the DB when transaction modifications exceed the cache capacity and must be written.

I suspect this pragma predates WAL mode and the documentation reflects the behavior before it.

The large folder is in the “Preparing to Sync” stage for ~8 hours now and the Syncthing process uses 1948M of RSS memory. This is far below the total cache_size allowed for only one connection to our largest folder (7GiB).

I don’t see much of a performance difference. In fact the needFiles decrease slowed down compared to the previous run and the needBytes decrease sped up so there’s no clear difference.

I’ll let the current version run a bit more as it is still making relatively good progress. But if nothing changes I think cache_spill can be left alone and cache_size might be too (or more moderately increased/left to the users to tune by themselves).

at first, I was also very enthusiastic about cache size changes, that’s the first thing I was thinking about like “this will solve all the problems right now”. It was like a half year ago already. The more I was exploring this, the more false it turned out.

Yes it helps a lot in some corner cases, but only when you set it enough to prevent 100% repetitive common index nodes reads again and again. There are not a lot of these, however.

Given numbers, in any of my installs, 2 mb per conn (default) is not enough, and 32 mb per conn is absolutely enough for installation of just any size. Any further increase is showing only cosmetic difference, even for 10+M files installs.

That’s my personal takeaways here.

That’s consistent with what I’ve seen up until now. I had raised the value from 2 to 8 MB among other changes so even though I got positive results I wasn’t sure if the cache_size increase was responsible for a part of it.

I don’t have any further optimization in the pipe right now so I’ll probably revert to progressively lower values in the next days to see if I can find a sweet spot that confirms your 32 MiB.

I’m focused on the DBMaintenance as it was clearly the cause of our slowdowns and I was about to ask about how Syncthing allocates connection across the whole application to find out if large cache sizes could eat a disproportionate amount of memory (16 conns with a cache 32 MiB is 512MiB which is significant on the low end devices). If other connections are not doing much work and recycled often their cache_size might not be a problem but I would have liked to have a clear picture.

But thinking about connection life cycle I just realized something.

The blocks and blocklists cleanup use a separate connection. I’ll have to look at it more closely but the cache_size is completely useless if you reopen connections regularly. By splitting the blocks and blocklists in smaller steps I have helped the sync process but I might very well have slowed down these 2 tables cleanups significantly. I’ll have to look if there’s a way to modify these cleanups to use the current connection or at least reuse the same.

Just reused the current connection without adverse effect. And looking into it the reason for the separate connection seemed to refer to a previous DB schema : it refers to costly foreign_key checks that have to be disabled temporarily for performance.

But these two tables don’t have foreign keys pointing to them I can find. I found foreign keys for the links between files and file_names/file_versions (which have being perfectly fine being cleaned up without temporary disabling foreign key checks) but not with blocklists/blocks. So there should not be any performance problems when deleting entries from blocks and blocklists caused by foreign keys.

@calmh did I miss something ?

1 Like

Deployed to our server. The blocks cleanup seems marginally faster (the auto-adjusted block size is a bit larger which means the queries are a bit faster). That’s far from a game changer though : it is still abysmally slow (in the would take one year for a complete cleanup range of “slow”).

Note : even with this slow blocks cleanup this is still more usable for us, as long as sync was blocked there was no file to remove so no block to clean anyway…

I’m a bit stuck for one maintenance related optimization I wanted to implement.

The maintenance DB updates are blocking the Folder syncing process so the whole point of most of the changes in our branch is to limit for how long this blocking can occur.

But the folder can be in a state where it doesn’t matter if maintenance slows down syncing :

  • idle,
  • sync-waiting,
  • scan-waiting.

So as a last optimization I wanted to change the limits imposed on maintenance based on the folder state which would allow the GC to finish faster when a folder isn’t busy.

I found the FolderState code but I didn’t find a way to reach the appropriate objects from the Service and folderDB objects in db_service.go.

Can any dev give me advice on this ?

There’s something very surprising about cache_size, memory usage and DB connection handling.

After the last restart the memory usage of Syncthing seems to stabilize at around 350MiB instead of 1950MiB. According to my debug logs the instance still allows ~7GiB of cache_size for the largest folder and 2GiB for the other but this time I never saw the process reach even 500MiB.

The only difference is that I removed the separate connection to the database for the blocks and blocklists cleanups.

I assumed that the larger cache_size would easily fill up with a 60+GiB DB but it isn’t. Syncthing even wrote nearly 40GiB in the WAL since the large folder reached “Preparing to Sync”. So unless it only rewrites the same data it should have had the opportunity to put quite a bit of data in cache.

I’m left with 3 theories :

  1. the connections are recycled too fast for the cache to grow (if sqlx has a maximum time before closing a connection and reopening it for example),
  2. in Scanning and Preparing to Sync states Syncthing even taking into account the maintenance queries aren’t enough to load much in the cache,
  3. SQLite cache use doesn’t behave like a normal cache.

Honestly I’m not convinced by any of them :

  1. if connections were reopened by sqlx under the hood most of the PRAGMA would almost certainly not be reapplied (if it’s the case that’s quite a problem and it’s unlikely it wouldn’t have been spotted earlier).
  2. the maintenance in Scanning managed to fully process the files and file_versions cleanup (I have debug logs that indicates the pagination fully covered the table when the folder.GetDeviceSequence doesn’t change) and file_names was on its way to be fully cleaned too. So in theory the totality of the indexes on deleted, name_idx, version_idx and the files primary key should have been in cache.
  3. ??

That’s seem quite likely in fact. Looking at @calmh PR about cache_size this works by modifying the connection URL.

I suspect that most of the PRAGMA in folderdb_open (the ones that won’t survive a reconnection) should be cleaned up and either :

  • moved to the connection URL,
  • use a ConnectionHook

I’ll have to dive a bit in the ConnectionHook interfaces on both drivers.

For auto-tuning the cache_size, modifying the connection-uri is probably not the right place : you must be connected to find out how much cache is useful.

3 Likes

I guess connection locality might also be a thing? Maybe you warmed up the cache of your connection but get a different one for the next query?

1 Like