Database block sharding, increased performance for large clusters

I hacked up a thing this weekend for sharding the blocks database for large setups. Basically, when the number of blocks in the folder database exceeds a threshold, two new databases are created and blocks are written to those instead. When they reach the same size threshold, four new databases are created, and so on, doubling at each level.

The precise level at which it’s optimal to switch up to the next level depends on the system performance. I’ve run benchmarks on my middle-of-the-road Intel server with NVME storage and my Raspberry Pi 5 with MMC storage and, unsurprisingly, they are quite different. Nevertheless, they both get significantly improved performance from sharding.

Here are a number of benchmark runs on my Intel server. The benchmark simulates inserting large files (1234 blocks each) in batches of 250 until we reach 200 million blocks or a maximum time of two hours. 200M blocks is about 24 TiB of data assuming the smallest block size.

The test does not reach 200 M blocks with the current code, stopping at 150 M blocks after two hours. With sharding at 5M, 10M and 20M blocks we see increased performance. In this case, 20M blocks per database turns out to be best, due to the increased overhead of doing inserts into 64, 128, etc databases. The code does concurrent inserts/commits to several databases up to the limit of the number of CPU cores in the system. This means we get nice gains when writing to 2, 4, 8 databases, but when we need to write blocks to 128 databases we’re going to be suffering a bit from multiple commit/checkpoint cycles per batch.

Looking at write performance as a function of the number of blocks in the database, we can see the spikes when new databases are taken on:

The Raspberry has no chance of reaching 200M blocks in reasonable time so the benchmark tries for 20M blocks instead:

Here the smallest sharding threshold at 1M blocks turns out to be most efficient, indicating that this will need to be a user tunable thing.

That all said, if you look at the topmost graph again you’ll see I highighted a span with a red rectangle in the lower left. That is were 95% of users reside according to usage reporting.

There’s some code at wip: use sharding for block database by calmh · Pull Request #10454 · syncthing/syncthing · GitHub if anyone wants to play with this.

10 Likes

This is cooler than heck sir. Thank you.

I’ve been looking at https://mon.syncthing.net and I’m not immediately finding stats for how many blocks are in databases. My goal was to find a rough estimate of when folks need to be thinking about this. My personal Syncthing cluster is not going to be that big :-).

I’m guessing this would be optional if it were to be incorporated. Most of my nodes sit on spinning rust so splitting the db into segments could become fragmented pretty quick.

I still think the best option for people like me is to simply have zero hashing / ignore the file in the db on files, say 50Mb and under and solely rely on modified attributes, where it’s changed, then download the entire file. It may sound counter intuitive but connection speeds are fast that small files are downloaded very quickly. The vast bulk of files are never modified once received yet are scanned for any changes.

Hashing, to me, only makes sense on really big files

I’d be surprised if the filename used had much to do with actual on disk fragmentation. If anything, you may be more likely to have multiple contiguous database chunks rather than one massive one spred out everywhere.

In any case, I’m leaning towards a default of 10M, so if you have fewer blocks you’d never see a difference. If you felt it was a problem, you could increase the threshold towards infinity.

Hashing enables block reuse and reduces transfers, yes, but it’s also non-negotiably essential for data correctness. That said, one could imagine simply not storing the blocks themselves for lookup purposes. There are in fact scenarios where it could make sense :thinking:

1 Like

I’m still trying to wrap my head around why more/smaller shards have the opposite effect on your two test setups.

I think it’s a question of a quicker performance drop with more shards vs getting concurrency early. Looking at the insertion rate vs block count for the 5/10/20M cases on the faster computer, individually since it gets a bit messy otherwise:

we can see write perf dropping fast down to the baseline in 10M vs 20M and again in 5M vs 10M. I think this is because when we hit 16 databases (fourth spike) we exceed the number of CPU cores and the concurrency in the code, and end up waiting first for insert and commit into 8 databases and then insert and commit into the next 8 databases, which is a lot slower than what happens for the earlier levels.

(The 5M run is also quite noisy, could be the computer was doing something else as well at the time, this was a an overnight run…)

In the Raspberry case we never get to that point but I/O is really slow to begin with, so whichever case starts using more than one database first wins.

Maybe we should just cap the number of shards to the number of CPU cores or something…

2 Likes

BTW it’s easy to run yourself on your own computer. Check out my branch above, then,

% cd internal/db/sqlite
% SHARDING_THRESHOLD=1000000 go test -run TestBenchmarkLocal -timeout 2h \
    -tags slow | tee ~/results.csv
TIME,FILES,BLOCKS,FILES/S,BLOCKS/S,SHARDS
1.58,250,308500,161.5,199248.9,1
3.77,500,617000,115.8,142923.9,1
...

I use gnuplot for the graphs, but it produces a CSV that can be massaged in Excel or whatever.

1 Like

Nice - thanks for the new/additional design improvements, data and code/benchs to play with. I definitely want to pick this up and play with it to report some data from other systems. I did quite a bit of benchmarking a while ago, but never got either consistent/reliable results or found changes with a big enough impact that it mattered. One thing that had a big impact was checkpointing of the WAL, and I wonder if this is also influencing the results here a lot: The write performance suddenly jumps up and then decays gradually, and that (or any other) pattern isn’t consistently happening every time sharding happens (which is at 2**n * threshold, right?). It even looks like the peaks aren’t aligned with sharding thresholds at all, e.g. the latest 5M graph seems to have peaks at roughly 5, 15, 35, 75, 160. I need to check the code though to see how it really works in practice, I am probably missing or misunderstanding a bunch of relevant aspects here. Regardless I’d like to add some visibility into checkpoining when running the test myself. Also just having data from my machine will be nice, because it’s going to be a very different system again to yours. And maybe some data with read workloads just to see that it doesn’t degrade (thought it really shouldn’t). Last time I did benchmarks with DB in RAM and on lvm-mirrored ext4 on relatively old SSDs (also tried btrfs mirror on spinning rust, but that was too slow to even give any meaningful numbers overnight, also not really a relevant scenario xD ). Hopefully on the weekend I’ll find time for this.

Given there are early benefits with low-threshold sharding on both systems and then diminishing returns with more shards resp. eventually more overhead than benefit, a cap on shards seems like the right setup/default here, together with a low threshold. That should give performance improvements in all scenarios tested so far (permutations of slow/fast IO and low/high block numbers).

I am not seeing that? The effects look the same to me, we are just looking at different regimes on the two setups: The slow raspberry setup is basically showing the same behaviour as the initial part of the fast, intel server setup. Initially low threshold is better, as it has sharding while higher thresholds don’t, in both setups. On the intel server graph it can scale enough to eventually see the overhead of shards pushing the lower thresholds/higher shard perforamnce below the others, while the raspberry never reaches those levels (with some biased interpretation one can already see signs of 1M getting slower and reaching 2M levels there too).
Anyway, to me results look aligned between the two systems, or I am not looking at the right thing - points appreciated :slight_smile:

1*5, (1+2)*5, (1+2+4)*5, etc

1 Like

Ah yes, practicalities. Do I get it right that whenever the limit is reached, new shards are created and only those are written, the old ones become bascially read-only?

Yes, exactly. The earlier-level shards will over time reduce in size (as blocks are removed & garbage collected) but always remain read-only. In principle one could imagine a process to repack data from older shards into the latest level, but I think it would be a possible future nice-to-have rather than something essential.

Similarly, there would be no migration associated with this. It’s all about insertion performance, so if you happen to have an existing database with 100M blocks in the main folder db, there is no advantage to move them out to shards. However next blocks to be written would cause the sharding level to be bumped.

In addition, it’s possible to downgrade away from this without too serious a problem. The shards will not be recognized (in fact they’ll be removed at startup by the database cleanup) and the blocks in them will be “gone”, but the only effect is that we won’t find them when looking for blocks to reuse so sync will be slightly more bandwidth intensive until the block database is rebuilt.

1 Like

I still have to take a deeper look at the sharding logic, but how does this affect read performance? Do we know the shard to query or do we query multiple (active + read-only from previous levels)?

Multiple, so one read per shard level.

I guess that’s the downside of dynamic sharding instead of using static partitions. Assuming that the keys are somewhat uniformly distributed, we’d benefit from full parallel reads and writes with the latter.

So, to be clear, we don’t need to look in all the shards, just one per level. That’s the price of not “getting rid of” the intermediate levels when we switch to a further level of sharding. (The advantage is that we can immediately start writing to n number of empty databases, instead of having to wait for an expensive resharding operation when writing, which would also fill all the shards to 50%.)

1 Like

Ah thanks. I was just mentally comparing it to something like PostgreSQLs declarative partitioning. But the drawback would of course be that we always need to roll with a fixed size of X partitions. That would add some overhead for small installations.

Feels like we’ve reinvented lsm trees (leveldb)?

There is no relational aspect to block presumably (given we can shard across dbs), so perhaps relational storage is not optimal here?

I understand that we dislike Go’s leveldb implementation given issues we’ve seen, but given the “no cgo” ship has sailed, perhaps use the real thing written in C or see something more suitable for purpose that doesn’t need overhead of a relational db? Badger, rocks, pebble tc.

1 Like

Most of these are not useable on 32bit ARM or other platforms we’re currently supporting AFAIR.

Having a reliable database backend also trumps most of the performance benefits offered by the listed alternatives.

This is also more of a sqlite scalability thing and not something inherently limited by the fact that it’s a relational model. Table partitioning is basic stuff for any DBA worth his salt and not some arcane trickery :person_shrugging:

Meh, spent time on the android situation instead of this. I do by now have two 2h test runs (default and very high threshold/disabled), nothing surprising in the data from a first glance. I have a hypothesis that the dynamic sharding isn’t necessary resp. that the benefits of concurrent writes and sharding regarding insert performance are independen. Though obviously the two aren’t independent - no concurrent writes without shards. Anyway to check the hypothesis I want to analyze the performance in the different “sharding regimes”, basically just fit them to get comparable numbers for performance compard to shard size and concurrency. I’ll get there eventually, hopefully soon.

And yeah, I was thinking about lsm trees too when I looking into sqlite with random primary keys/indexes. Didn’t feel like uttering that heresy :slight_smile:
More seriously imo the main reason to avoid it is just that a second DB is another significant dependency to juggle/keep up with.

The other solution could be a different database, but an external one. eg PostgreSQL

There are a lot of projects out there using sqlite as default while allowing powerusers to switch to Postgres/MySQL.