this seems exactly what I said earlier about maintenance
One benefit is that it would be very easy for experienced PostgreSQL DBAs to monitor, tune and provide advice based on what they see. I can pinpoint costly/inefficient accesses to a PostgreSQL DB but I am a complete newbie when it comes to reading Go code.
The drawbacks I see are :
- external dependency,
- maintenance complexity.
Users would have to configure PostgreSQL themselves (at least install PostgreSQL, create a user/database that Syncthing can access). Trivial for any DBA but not “plug and play” so only a small minority of users would go that path and you’ll have to support both SQLite and PostgreSQL.
Note that even though you probably have a Go abstraction layer (something like JDBC or Perl’s DBI) to hide the connection layer you might have to use different SQL syntaxes for SQLite and PostgreSQL to better leverage their individual capabilities. At least I’ve seen some TRIGGER uses, they most likely will have to be adapted.
BTW, I’m looking at the database schema and I might not understand everything because I lack the code context and experience with SQLite but there are two things that I would investigate :
-
blocks reference to blocklists
The blocks table seems to reference the blocklists table through the blocklist_hash value which is a 512 bit hash if I’m not mistaken. There should be far more blocks than blocklists entries (from my understanding of the documentation you seem to target ~1000 blocks per file for large files). So you might benefit from a simple 64-bit ID in the blocklists based on a sequence to reference it from the blocks table. It should reduce the blocks table size by ~30% (total data payload per row is 152 bytes and it would reduce it to 96 bytes) and raise the blocklists size by a very slight amount (I’m not sure what blprotobuf stores in this table but it can be huge). I don’t see even the largest Syncthing installation overflowing a 64-bit integer for blocklists over their lifetime.
-
TRIGGER usage
I’ve seen that files modification updates two tables using triggers. In particular counts is updated with it and it can be wasteful in some cases. If you usually update the files table with very short transactions involving a handful of updates the triggers don’t matter. But if you have long transactions with very large numbers of modifications to the files table (which you might want to limit IO) the triggers may hurt you and would be more efficiently handled by the code (accumulating the size and count modifications in memory until just before the transaction commit where it would apply it). I’m not sure how SQLite deals with them exactly as I’m not familiar with the isolation levels between transactions but I suspect given the WAL sizes I see that it stores each and every modification in the WAL and at commit they have to be applied individually.
Yes, around 60GB since the last restart when I activated the case sensitive fs option.
Currently :
FILE | size | last modified
folder.0002-qioaq2fp.db | 60 539 142 144 | Jan 27 20:31
folder.0002-qioaq2fp.db-shm | 369 754 112 | Feb 4 12:35
folder.0002-qioaq2fp.db-wal | 33 107 244 712 | Feb 4 12:35
Current DB abstraction layer is kind of "worst possible” bottleneck anyway. It has both wrong things like APIs which are impossible to implement efficiently, together with all efficient APIs like batches etc completely missing.
My impression is that this API should be seriously reconsidered to get any real gain here, all we discuss is cosmetic that will not help a lot. Yes we can save a lot improving schema etc and make it at lease fit some caches again. But many of these internal DB API calls just got so many magnitudes worse since v1, that no better DB will solve.
I was also thinking to just add MySQL there, but looking at the API, I settled with impression that will be of small use.
UPD: my point is that for these scalable installations there is no need to do anything what will help like 50%. It just don’t worth time spent. For sync of the phone pics to the main PC, current approach is very solid and no need any of 50% improvements anyway. For millions of files, these 50% will be just as slow as now.
UPD: maybe just add some better batching right now? Ignore all SQLite shortcomings as given const, and go to app level improvements.
Final UPD: I am sorry to being close to impolite that I am just writing all the time “do this, do that”, without real contributing. I understand this can be annoying.
Note that the two subjects I find worth investigating have (for my limited point of view) potential for at least 10x performance improvements for some computations.
The blocklists integer primary key would not only reduce the memory usage but significantly speed up searching for blocks part of a blocklist and block creation. Comparing 512 bit blobs is far slower than comparing 64 bit integers. Even using a B-tree index doesn’t make the difference disappear.
In fact looking at the blocks table I wonder how blocks lookup work in Syncthing. I assumed this would be a regular occurrence and there would be an index usable for this lookup. But the only index on the blocks table seems to be created for its primary key :
PRIMARY KEY(hash, blocklist_hash, idx)
If Syncthing needs to find all blocks for a file it seems it gets the blocklist_hash from the files table and then has to select all entries in the blocks table with this blocklist_hash.
Unless SQLite is doing something very clever behind the scene, the primary key should create a btree index based on the ordered list “hash, blocklist_hash, idx”. This is very efficient when you need to find a block by its hash but completely unusable for finding blocks by their blocklist_hash. What could be looked up in mere milliseconds if there was an index on blocklist_hash will need a full table scan and depending on the storage capabilities and current load on it could take several minutes (Edit : and I need to stress this, only for one file) for a table that could very well be several tens of gigabytes.
That’s 8h of radio silence. As the incremental vacuum run has no log output, it would fit to my assumption.
Comparing 512 bit blobs is far slower than comparing 64 bit integers.
That’s also my proposal and our internal implementation for 10M+ installs - we replaced all indexes with hash indexes (4 bytes). True that current index lookup is a huge waste of page pressure.
We also did that for files table - there is absolutely the same there, even a lot worse.
And yes it helps A LOT - at least in our install this makes “not working” situation into “working again”.
But this is fix that cannot go global without reconsidering DB API layer.
Your proposal is a subset of that where this can be done without breaking any API, so yes a very good point.
when I was disabling maintenance on my side (months ago), I first checked if it was full vacuum maybe. My idea was to shut down a few times between these maintanance cycles, and do a vacuum with SQLite tools. In my situation, vacuum using tools, in any mode, was A LOT faster than I observe folder locked during maintenance. I concluded it is not the vacuum. But this was my situation. Not sure if it is globally valid.
I wanted to test the blocks lookup speed change after creating an index.
Unfortunately that isn’t allowed by SQLite (probably because it relies on file locks and only one writer is allowed at a given time). I’m so used to PostgreSQL that I didn’t expect it ![]()
sqlite> CREATE INDEX blocks_by_blocklist ON blocks(blocklist_hash);
Runtime error: database is locked (5)
As the folder scans take almost a day I would have to wait ~2 days to see if it makes a difference. Can someone guess how often block lookups by blocklist_hash happen ? If in my configuration (29 million files with several thousands added each day on the send-only side) there are more than 10000 lookups each day it is probably worth stopping Syncthing to test this.
PRIMARY KEY(hash, blocklist_hash, idx)
during normal operation, this index seems to be enough for doing that because in the code it seems that you always query for both with AND. This index is indeed huge howerver, (see above on hashing it, wont repeat, we reduced it to 4 bytes)
(not sure about maintenance part however, these “delete where not” parts, we also just deleted the maintenance, but these are are interleaved with “GC was interrupted due to exceeding time limit” messages - so this part is probably under some control and if no messages, than, no this part matter)
UPD: my impression is that maintenance is kind of OK to do for a long time, just for large installs, not every 8 hours. Disable it altogether and do once a month. Will work just ok as well. The patch on that that made it controllable is now mainstream, so not a big deal here. You will want to disable it ANYWAY. So my advice to do so and ignore maintenance for large setups. You will be faster to the finish if you restart it without that right now anyway.
Judging from the logs it very likely is. The only thing happening between GC was interrupted and Skipping unnecessary GC is the invocation of tidy() which in turn consist of incremental_vacuum followed by a truncate checkpoint.
this seems more likely to me by the way
but in any situation, WAL growing comparable to DB - is a wrong DB batch design (missing batches), as pointed out earlier. So maybe the solution to both vacuum and commit is to avoid WALs of such a size, by improving batching?
It’s most likely the combination of both. The incremental vacuum run might blow up the WAL to a huge size which the checkpoint invocation needs to feed back to the DB.
quick googling says incremental vacuum does not grow WAL
UPD: well grows of course, but at least not that size
UPD: but your proposal to pass pages N to it will validate this anyway
my counter-argument to WAL size related to commits / vacuums: it grows within first 8 hours of operation as well, actually, it can easily explode to more than DB size within an hour after the start.
In any situation, the real improvement will be to wider the API and go in details from the app side of things. You just need here all that the solid app will require from the ground up in any new design, - not “*all” calls but "top any”, not update, but start update / end update, etc. API changes.
Can we keep the scope of the discussion a bit more narrow? The whole “the API needs to be rebuilt from scratch” only adds noise to be honest.
I’d like to focus on small and actionable steps. There’s a smoking gun right there in the logs. If that’s addressed, we can move to the next finding.
There might be a lot of optimization potential in the way we’re doing queries and how the current database model is layed out. But throwing a mixture of many of these problems in every second post isn’t helpful (or at least for my brain).
sure, was sorry for that already and I understand this myself,
but consider these two considerations:
-
What exactly can help to speedup the commit of 100GB WAL file? I am just writing the concern that addressing it with “let do it 50 pages at a time” is too late and the cause is earlier, and if it grows that large, nothing on earth will make the commit efficient.
-
Also many times this is kind of meta-discussion on when the sum of efforts to get all the low hanging fruits is just the situation when you just keeping the bad bet and not accepting the efforts loses in time. Faced this many times in my own long living projects like I may be struggling for years and then after at last I implement it right in a week, kind of, “why the hell not did this earlier??”.
But sorry again and I will cease the noise here.
I agrre that we need to find out what’s causing the WAL to grow that large.
The good thing though is that the vacuum change might solve it. And even if it doesn’t we didn’t waste time as it’s still an improvement ![]()