Certainly seems like it would be yes. I guess I think most setups spend a fair amount of time idle, but for sure there will be those that don’t.
The truncate is indeed to make sure all the pages in the WAL are written to the main db, and then truncate it (to some nonzero size if desired, currently 64 MiB in our code). It does need to happen from the end, so nothing is allowed to hold a read lock on those pages.
In practice this sounds like we need to periodically grab a super lock and prevent all readers and then run a PRAGMA wal_checkpoint(TRUNCATE)… (Currently we do that every 8h but without the super lock thing.)
We could be at least reasonably smart about it and keep a counter so we do it after every like 5000 write transactions or whatever.
Of course this will easily turn into deadlock palooza
It’s not just a reader. If there is a write after checkpoint started, it can’t be finished, as it doesn’t know how to roll a new WAL file. The data from the WAL gets applied to db, but the file stays as big as it was.
I think some of my real-world setups are like that. Basically, a lot of small-sized folders syncing something all the time. The amount of the synced data is usually quite small, but (at least some of) the folders keep working constantly.
While that may well be the case, note that “working constantly” from a human point of view is maybe quite different from the database point of view, the latter being “overlapping write and read transactions at literally every point in time always”. A few files changing every few seconds will very much not be that, for example, even though the disks may be churning all the time.
Yup. I wonder if we could solve this in practice by making our read transactions much shorter in time. Currently we do the natural thing: loop over file infos, put them in batches, send them over the wire. This essentially means our read transaction is blocked (kept open) by the time the other side takes to write the file infos. We could make each batch a separate read transaction, so that instead of selects lasting up to 5 seconds or a little more we’d break them down into selects lasting milliseconds and blocking outside of the select while sending. This might give the checkpointer a better fighting chance…
But it’s probably for the better to have this managed by a dedicated goroutine. Do a regular checkpoint and if that fails resort to a RESTART checkpoint.
What happens if there are multiple devices exchanging indexes at the very same time? For example, I’ve done a test with 3 devices (actually Syncthing instances on the same computer), all starting with the same folder, pre-populated with files but unshared. I then shared the folder between all three at the same time.
Doing so brought the whole system to a crawl, and this is Ryzen 9950x with 96 GB of RAM, and the database was located on a RAM disk.
The folder itself is the same as I was testing previously today, i.e. on the larger side but nothing too extreme:
The situation only stabilised once I limited the index exchange to just two devices at the same time (by pausing the folder on the third one).
Using shorter read transactions and periodic checkpointing solves this for me. Either of them resulted in a reduction of the WAL size, together it’s altogether reasonable in my setup. With a DB size of 2.5 GB I got a WAL size after --debug-reset-delta-idxs of ~2.2 GB previously, now it stays at ~128 MiB.
This would mean you get conflict resolution and index entries bouncing back and forth for every file and pair of devices, so it’s a bit worse than normal worst case for most setups.
@bt90 BTW I noticed an oddity, I tried PRAGMA wal_checkpoint(TRUNCATE) like we do in the slow periodic, and while it worked it truncates back to zero bytes, regardless of our PRAGMA journal_size_limit = 67108864. I tried moving that pragma to just before the checkpoint as well with no difference. From the docs I think it should have the effect of leaving the WAL at that size, but it appears not to in practice.
Nope, connection pooling issues were the cause. Need to run the journal size limit pragma in the same connection just before the checkpoint.
I’ve tested the improved handling of WAL. During the index exchange phase, the WAL file kept growing up to ~800 MB, then shrinking to <100 MB. The process kept repeating itself until the index exchange was done. This is with a database sized at 1.5 GB.
Just a completely separate question, but is there any particular reason not to continue using db for database debug logging instead of the current sqlite? Is the old db logging used for anything else?
The wal_checkpoint pragma returns a single row with three integer columns. The first column is usually 0 but will be 1 if a RESTART or FULL or TRUNCATE checkpoint was blocked from completing, for example because another thread or process was actively using the database. In other words, the first column is 0 if the equivalent call to sqlite3_wal_checkpoint_v2() would have returned SQLITE_OK or 1 if the equivalent call would have returned SQLITE_BUSY. The second column is the number of modified pages that have been written to the write-ahead log file. The third column is the number of pages in the write-ahead log file that have been successfully moved back into the database file at the conclusion of the checkpoint. The second and third column are -1 if there is no write-ahead log, for example if this pragma is invoked on a database connection that is not in WAL mode.
So, I ignored the return values because I didn’t fully understand them and they didn’t seem actionable but I can understand the part about auto checkpoints blocking manual checkpoints and neither progressing, except I thought auto checkpoints happened as part of the transaction commit and that is guaranteed to not be at the same time as our manual checkpoint…
Turning off auto checkpoints seems like a net loss since then we’re down to only managing it ourselves, and for small setups I suspect we’re guaranteed to let the wal grow larger than it needs to be.
We can try it though and see what happens, it should be a fairly easy change.