Managed by the runtime : yes (although I’m not sure how much an advantage this is, I don’t care what is responsible for this as long as it works properly)
zero CPU : just spinning would use CPU. But :
by retrying we will call SQLite which I assume is not just spinning :
writers simply block so almost 0 CPU,
readers fail on conflicting transactions and after waiting busy_timeout for the transactions to resolve themselves. The cost in CPU is parsing the query and detecting the conflict. After the conflict detection there should be 0 CPU until SQLite unblocks (timeout expired or conflict resolved).
nothing prevents us from retrying with some basic exponential back-off which uses 0 CPU between retries.
fairness : that’s the job of SQLite when accessed by several clients. I’m not aware that it doesn’t fairly unblock the writers (for the readers there’s no real queue to manage), I assume each writer access a shared lock managed by the operating systems which both costs 0 CPU while waiting and is fair (at least Unix provides a plethora of locking mechanisms that should match these needs easily)
Spinlock : strictly speaking I wouldn’t use a spinlock but a retry with back-off.
A retry doesn’t cost anything at all in the main path : it would test the error returned by SQLite like almost all queries already do. So no error → no cost. Technically it would even cost less than a Mutex (arguably not measurably less) in the usual case.
There are downsides to the Mutex too in the current code (that could change somewhat if they were used in the lower levels or modified though) :
they are very intrusive as in most cases they :
wrap the code in a func,
lock,
add a defer for the unlock
they don’t have any context so they can’t be extended and can’t be adapted for priorities if need be. The code is currently blocked without any opportunity to signal a long block to higher layers or log.
Not having code running between retries prevents us from having some control and visibility. You can’t choose to abort, you can’t choose a retry frequency to implement priorities : the fairness here is a limitation.
I’ve read it but I’m trying to validate this by reading the code. SQLite wouldn’t be the first database lacking in the documentation department. I’ve learned the hard way by finding a bug in MySQL maybe 15 years ago on Windows that broke the durability guarantees…
I have preliminary ideas on how the checkpointer could do partial updates to the DB files while the readers are accessing it but I see many problems and I’m not yet familiar enough with the on disk format to reach a conclusion.
I feel simple single write mutex is ok for writes, but it can also present in paths with get semantics, such as in GetIndex something (sorry on the phone), or, indirectly, in any of paths with SetKV with some last-seen stats, for example.
In current approach everything seems fine to me, except this weak point.
With respect, I think you overestimate how smart the SQLite busy handling is, and underestimate how smart the Go mutex handling is. The SQLite busy handler is just a loop around short sleeps with retries. Exponential backoff would make this worse as we’d spend even longer just waiting to be able to talk to the database. Go mutexes maintain a queue of waiters with immediate handoff from an unlocker to the next locker.
On isolating readers vs commit, I vote for very high level app logic modifications, such as, not catching readers on SQL layer level, but introduce some higher level backoffs instead. Like stop the world, but from the model perspective. It is fine than it is not a guaranteed procedure then, as it will not control readers directly. But this may introduce other benefits as well, such as global backoff knob, many of us inplement anyway via priorities or GOMAXPROCS.
It will hang anyway. Why not do it controlled then?
UPD: like, facing a locked mutex on lower layer, is worst possible as it seems to me
The first part is indeed possible, SQLite has evolved quite a bit (I wasn’t even aware it supported TRIGGER a week ago…) from what looked initially like a minimal implementation of SQL on top of a single file. This leaves me with quite a bit to learn about it… which is why I’m reading its code.
For the second part, given what GO was designed to do I don’t have any reason to underestimate it.
But to put things in perspective : we are speaking nanoseconds vs milliseconds. updateLock preventing SQLite to waste milliseconds has slowed one folder to the point that I’m still waiting for it to finally begin to sync again after 2 months…
I think we are seeing the edge of the woods for this folder. But you’ll understand that with my recent experience I’m more willing to accept wasting some milliseconds to gain more control on how the queries are fed to SQLite.
It seems that there are two problems involved here.
You are optimizing maintenance which is just GREAT. No kidding. But practical approach would also like just lets do it rarely. And interleave it with restarts/pauses/… to make a commit.
Another point of the problem is that you are having exactly like 20 file updates per second as I computed for you graphs (I may be wrong, correct me then). This is just 20 per seconds fsyncs to the storage, on each update, on each new WAL record. Very typical TPS for just RAID fsync. This can be faster or slower but if you respect write barriers, it will be this TPS no matter what.
It is A LOT slower than anything about locking etc. And will stay no matter what, before someone introduce batches here. Why do you think that generic rewrite of the whole contention logic will improve it here?
If that explained, maybe other parties will have some insights. I think this is the current tension which prevents the agreement on means proposed.
UPD: me here in the same boat. Not complaining about “your storage is slow” - I am against this kind of thinking.
OK, I hear you, both, so let me be the one to hurt your feelings – your storage is too damn slow. There is really no excuse in 2026 to have the database on mechanical disks when you’re clearly building a high performance system and throwing a lot of hardware on it. Spreading it over multiple disks won’t help at all with sync write IOPS. Get a couple of NVMe:s in there.
Then we should still do all the other optimisations because performance will still suck, as there are plenty of posts and threads to illustrate, but you’re never going to get nice performance with a large database on mechanical disks.
I can confirm that the current implementation is a busy wait. That said this is far from the worst I’ve seen (and the worst is the actual “busy” wait where the client doesn’t even sleep between checks).
It uses short sleeps increasing up to 100ms. Clearly not the best… but that said I didn’t know how much it costed to detect that the current query is still in conflict because that’s what would determine the CPU cost of using PRAGMA busy_timeout.
For the recommended 5 seconds in the worst case the default callback makes 49 tries so that’s 49 times checking for conflict.
After following the code (the busy_handler is affected to a Pager.xBusyhandler and the only call to the busy_handler is in pager.c) it seems that ignoring very simple code doing comparisons and binary operations we end with only a single system call in sqlite30sLock which is a file based lock.
So inelegant, not ideal and a waste of CPU cycles but in the end this is relatively benign. When I said that the CPU use was probably measured in milliseconds that was almost spot on. On my laptop, a flock system calls takes about 2ms (measured with a loop over the “flock” command in the shell so that is a very very inefficient test) so in the worst case (of the recommended 5s timeout) one busy wait :
costs ~100ms
adds a mean latency of less than 50ms (depending on the amount it must waits).
Now I’ll continue digging through the checkpoint code paths…
I am on NVME-only and I am not complaining about TPS at all. It is reasonable on my side, >1K anyway, and I never face any weeks-long scans. Nothing like that.
My complain is mostly about IO pressure for nothing. So basically all this currently here is out of my scope in general - except - the few great findings that maintenance is doing the global scans, but I tend to avoid it at all completely,
and the commits in my situation are perfectly handled by periodic idle time and db.SetConnMaxIdleTime(10 * time.Second)
But I am not sure how mainstream my ideas are here.
I agree with your concerns, there’s clearly a problem and regression that needs to be addressed regardless for the really large setups. Unfortunately, I don’t know how to do it in a good way.
However, there will always be a tradeoff for hardware. Syncthing is intended to work fine on slow hardware, and it does, for reasonable workloads. What won’t work is taking a 99th percentile sized workload and putting it on an I/O system magnitudes slower than a random laptop, regardless of how we optimise.
That would be a perfectly reasonable recommendation when building a new system.
But… this system has its root in a system put in place in 2014… So it isn’t a new build it is one that worked perfectly well with Syncthing v1 for years.
In the early days it used simple rsync calls, then the folders grew and we needed to use file access notifications to avoid the pain of scanning the whole tree repeatedly which led us to lsyncd (and this is after tuning the kernel to keep almost all dentries in cache). It began to show its limits some years ago and we ended up with Syncthing which did the job.
So I know the hardware can do the job, it did it 2 months ago with v1. The software is the limitation. There’s no reason for me to purchase a whole new system that supports NVMEs in addition to a RAID array (did you see the prices of 8TB NVME ? HDDs aren’t going anywhere) and transfer 20+TB to it which will probably take weeks when I can fix the software in days. Which I just did… So the proof is there for everyone to test.
I also have all the solutions needed on my side, but I am having trouble with “my weather is a lot better here” , and I also miss Go knowledge etc. Exactly like Gyver (my great respect)
(not like the weather is actually better here, it is snowstorm disaster, but anyway)
Awesome, happy to incorporate and merge any fixes!
Regardless, the way Syncthing is designed it will always be limited by the database speed, both in terms of queries and write transactions. (There are also other limitations, of course.) So all other things being equal, putting the database on a couple of SSDs will definitely move things along a lot faster, before or after your fixes.
(And just to be clear, I’m not saying buy a rack of 8TB NVMes, I’m saying you’d probably be well served to have a couple of small SSDs for the OS and databases.)
I’ll look into it. I have to test my idea and the busy_timeout, but I’ve other things to try, from the top of my head :
PRAGMA cache_size that could be helpful but it should be a tunable (on my system I’ve no problem allocating gigabytes to it but on a Raspberry or a phone the default value is probably the most appropriate),
Implementing a log of blockhashs linked to recently deleted files to avoid having to find the ones that aren’t in files to cleanup blocks and blocklists. This is not trivial but doable and the blocks table is a real pain to clean…