Is Syncthing v2 with very large folders possible ?

There’s another radical idea I’ve toyed with, for at least some scenarios. The only purpose of the blocks table is to optimise pulls, by being able to look up blocks for reuse. Devices that are mostly intended to send files and only rarely pull, or where you don’t care much about pull efficiency in terms of bandwidth, don’t really need the blocks table to begin with.

It’s also possible to easily recreate it from blocklist data, so a larger hammer for maintenance purposes could be to simply drop and recreate it.

to refresh my findings in respect to the latest context,

I am running both personal and corporate with these altered -

  1. MaxConns limited to 6 instead of 16, to better utilize Page Cache, becuase it is per-connection and not shared (a pity). This runs fine. For folder DBs only.

  2. PageCache increased a lot. To your taste.

  3. To increase PageCache REALLY a lot, I am adding SetConnMaxIdleTime(10 * time.Second), so it closes the connection, together with associated page cache, ASAP.

This also runs fine, but more SQLITE_BUSYs, because the commit is uncontrolled in this situation - when last DB conn is closed, it is doing the commit, and new conns just cant connect. So I’ve just enabled the retry handler.

This all works just fine but it is a series of hacks demonstrating the concept, instead of proper implementation.

Also,

  1. Indexes system altered, but this conflicts with current DB layer API. Another story.

Nice. We should be able to adopt some of that as a tunable. Also, explains the SQLITE_BUSY you see; with default settings we should never close the last database connection, so that shouldn’t happen.

in this my design, I faced BUSY on first very day, and this is the thing 100% clear and not into the issues, for sure.

My repo issue is for vanilla build. The point why it happened catched by @bt90 correctly as it seems - there is still possibility for that.

I keep thinking if readers/writers benefit equally from a larger page cache. A cache miss for a reader has a bit higher latency as you’ll hopefully hit the OS filesystem cache.

But writers get hit by a low cache a lot harder:

  • necessary reads of pages for the query are slower. This is the same as for readers, but with writers being a 100% sequential thing, work piles up
  • dirty pages are kept in the cache before being written to the WAL. If the cache is too small, we get additional IO from cache spilling
  • writers also need to update the affected indices. That is also improved by a larger cache size

My point is that it might be better to give a larger portion of the RAM we have reserved for a folder DB to the writer instead of sharing it equally among all connections.

my intuition is that it does not matter,

but the proposal to issue PRAGMAs before each new connection role in a connection seems a bit strange for me because can introduce undesired side effects from inside SQLite which we will be discussing 10+ more pages.

If you insist it is required, maybe just open folder DB twice? With MaxConns(1) for writes and MaxConns(15) for reads? My quick googling shows it is allowed in SQLite Go wrapper as it seems.

This may introduce additional benefits such as remove the write mutex alltogether, if you do all the writes within the pool with MaxConns 1.

UPD: or if not, any way to identify specific connection from the pool and dedicate it.

UPD2: also, write connection cache will be more relevant for current write load.

That’s a bit more complex. The OS cache isn’t as fast as the DB cache because the DB code :

  • has to parse the data from the OS cache to write the in-memory representation which is different,
  • the OS cache doesn’t know that for example a 4k page (or whatever the OS cache granularity is) only has 1 byte of useful data for the DB and will keep the whole 4k in the cache if this 1 byte of data is requested frequently.

Caching is a very useful tool but it can be a very tricky one. As a general rule unless you know that your particular problem has already be tuned you’ll have to benchmark to find up to which level growing the cache is still beneficial and choose a value at or below this level to make everything fit in memory safely.

I’ve forgotten about one other idea : using PRAGMA temp_store = MEMORY.

What triggered this idea is that I saw the .tmp directory in index-v2 being accessed in production but very briefly. I suppose that SQLite decides to store some temporary data in it that isn’t very large (if it were piles of gigabytes I would see the IO and I don’t) and could be faster to write in memory. Or maybe Syncthing itself writes something in this directory ?

I think we had this at some point? But I agree it’s worth trying. Given the fact that it doesn’t show up as problematic in your setup indicates that it should be safe.

That’s my proposal since the initial sqlite PR :wink:

Reading the sqlite source code seems to confirm that the documentation is OK. But I just found out that the checkPointer uses the same busy_timeout handler mechanism to wait for WAL readers and lock the WAL access (in walCheckpoint in src/wal.c).

I’ve not yet traced the origin callback used as the busy timeout handler but if it is the default one this means that any reader concurrent to the checkpoint would transform a CHECKPOINT_TRUNCATE into a CHECKPOINT_PASSIVE.

This could mean that there’s a very good reason to use busy_timeout on systems that almost never are idle : this would actually increase drastically the chances of truncating which is apparently non functional on very busy instances.

Done : it’s the one set in the Pager, which is the one set through the public API. So my understanding is that there’s only 2 paths to allow truncate :

  • prevent all connections from querying the DB before triggering the checkpoint TRUNCATE and unblock them after,
  • rely on busy_timeout to let SQLite do it for us.

It won’t surprise anyone that I favor the second approach. That said for it to work reliably I believe the busy_timeout value must be greater than the longest query/transaction that can happen concurrently to guarantee success.

The longer the busy_timeout, the longer the checkpoint will block new readers/writers while waiting for the existing ones to finish. So having short queries and transactions is especially important.

To help with the truncate and avoid long freezes, we have to make regular checkpoints (maybe not all truncate or with lower busy_timeout to avoid them blocking new readers/writer for too long).

There might be ways to make this more clever by fetching the amount of data in the WAL already checkpointed and its total size. Checkpoint NOOP can provide such data but that’s another subject.

and why not? seems no any single drawback

Now when we know more about cache, this may sound more solid than before. I dont believe in importance of different sizes for these caches and no idea how to size them. But I do believe that dedication of the caches to read or write context will improve, for sure as of me.

Note about SQLITE_BUSY. My memory was a bit fuzzy but I thought I saw something that could create them for Syncthing even with the updateLock.

The second point was what I half-remembered :

When the last connection to a particular database is closing, that connection will acquire an exclusive lock for a short time while it cleans up the WAL and shared-memory files. If a separate attempt is made to open and query the database while the first connection is still in the middle of its cleanup process, the second connection might get an SQLITE_BUSY error.

That is indeed very unlikely but on moderately busy instances many not be impossible.

I’ve not yet looked at the connection pool management if any but I’d guess that very busy instances always have at least a connection opened. At the opposite point of the spectrum the mostly idle instances probably open connections too infrequently to hit this case.

Except the checkpoint doesn’t block readers?

FULL blocks concurrent writers while it is running, but readers can proceed.

Hum, you are right. Technically it blocks them by blocking each of the WAL slots for the readers but I missed that it unblocks each slot right after having the lock.

If I understand correctly this is actually how it makes sure that new readers use the state at the end of the WAL to be able to truncate it.

And finaly no I was right, I looked at the wrong part of the method. This is only valid up to FULL. At the RESTART and TRUNCATE levels there is another round of walBusyLock later in the process that waits for all readers. Then truncate and then unlock access.

This is actually described in the doc on the RESTART level.

I must admit I didn’t read the code path the readers take.

But I believe it unlikely that they wouldn’t be affected by an explicitely exclusive lock that is used to wait for them to stop a few lines above for the first round of walBusyLock where it is used to wait for each slot individually.

In the first round each slot is unlocked just after the lock being acquired. In the second round all slots are locked at the same time and then unlocked at the same time.

So technically SQLite can write that readers aren’t blocked because they are just a bit delayed until the last one finish its job and then almost immediately all of them can resume. So there’s almost always one reader working.

But in practice this will block some readers.

Edit : now I have a doubt. If a reader can determine that there’s no information in the WAL while all WAL reader slots are locked it may not have to take one of these slots. Damn, I’ll have to find the reader code path to make sure.

Are we talking about a “stop the world” block or would it merely trigger the busy handler if you’re unlucky enough? Because the docs are pretty clear that the first scenario doesn’t happen and we shouldn’t really get spooked by the second as we already concluded that a reasonably sized busy timeout is a good idea.

I’ll get back to you. I should be in bed and I’m trying to keep enough of the code base of sqlite in mind which is new to me to get through…

That is exactly what happens according to walTryBeginRead in wal.c

So in conclusion, you were right the readers are not blocked in practice.

There are probably a few very minor slowdowns given the amount of code a reader must hop through with some contention around the Wal index that resides in shared memory and must be consistent when accessed. But there’s absolutely no waiting for IO from the checkpointer in the path so these are not the kind of slowdown we could be concerned about.

Edit : in fact the readers “take a slot” but use a shared lock on it, not an exclusive one. But it seems the xShmLock I’ve met along the way is something specific to sqlite and not the usual shm lock I remember, I might have misunderstood some parts of the code where I didn’t realize that there were differences yet.

I’m tired and wondering how my brain will process this in the next hours where I should sleep soundly…