ChatGPT's take on our schema

I asked ChatGPT:

Using these settings: “journal_mode = OFF”, “default_temp_store = MEMORY”, “temp_store = MEMORY”, “foreign_keys = 0”, “synchronous = 0”, “locking_mode = EXCLUSIVE”, we are migrating data from leveldb into sqlite using the following schema, followed by the DELETEs/INSERTs/SELECTs/UPDATEs used by syncthing. Users with large datasets are complaining the process is very slow. What can be done to speed things up?

Included after that question is our DDL in ./internal/db/sqlite/sql/schema/ and most of the SQL statements in our code. It suggested:

Top wins (in order)

  1. Load into a “quiet” schema, then rebuild derived data
  • Temporarily drop the triggers that maintain indexids and counts, and drop secondary indexes on the hot tables (keep only the primary keys).

  • Bulk-insert everything.

  • Recompute indexids and counts set-wise in one or two statements.

  • Recreate the indexes and triggers at the end.

This alone is commonly a 5–50× speedup on big datasets.

  1. Replace OR REPLACE with UPSERT
  • REPLACE does DELETE + INSERT, firing AFTER DELETE/INSERT triggers and touching every index twice.

  • Use INSERT … ON CONFLICT (…) DO UPDATE … so existing rows are updated in-place.

  1. Remove AUTOINCREMENT where you don’t truly need it
  • INTEGER PRIMARY KEY already auto-increments; AUTOINCREMENT forces extra work and contention on sqlite_sequence.

  • For files(sequence … PRIMARY KEY AUTOINCREMENT), switch to INTEGER PRIMARY KEY unless you must forbid rowid reuse across the entire lifetime.

  1. Batch and pipeline
  • One big transaction (or a handful of large chunks), prepared statements, and avoid per-row RETURNING round-trips by chaining the RETURNING into the subsequent insert (example below).
  1. Right-size the engine for bulk load
  • Before creating any tables in a new DB: PRAGMA page_size=32768; (or 65536 if your build supports it) PRAGMA cache_size=-1048576; (≈1 GB cache; tune to free RAM) PRAGMA mmap_size=2147483648; (2 GiB, or more if available) You already have temp_store=MEMORY, synchronous=OFF, locking_mode=EXCLUSIVE.

  • After the load: ANALYZE; PRAGMA optimize; and flip to your normal runtime pragmas.

  1. Add targeted partial indexes for your read queries
  • Many of your selects filter on flag bits; a few small partial indexes will avoid scanning the large general ones.

Thoughts? I know ChatGPT makes mistakes, and I’m not conversant enough with our schema, or sqlite in general, to make an informed option regarding its feedback.

1 Like

Frustration, mostly, but why don’t you see how and to what extent the suggestions apply to us and what improvement they would result in.

1 Like

Just a side note, but I always wonder how many “programmers” just follow AI suggestions blindly and apply them to their code without any additional consideration or even testing :shaking_face:.

4 Likes

The hallucination machine managed to cough up some interesting bits. My two cents:

  • the whole “cleanroom” import is nice for the migration phase but doesn’t work afterwards.
  • upserts → good one. That should avoid quite a bit of index updates. Updates are also a lot faster than DELETE+INSERT.
  • autoincrement → good one. Explained in the docs: SQLite Autoincrement
  • do we even use the RETURNING clause?
  • page_size → benchmark worthy; but we can’t change that for existing v2 deployments :confused:
  • we’re already using a large cache and going bigger only ends up with memory issues on devices with limited RAM
  • mmap comes with some caveats AFAIK. This might be troublesome in OOM scenarios
  • i think we’re already running ANALYSE and optimize?
  • the partial indexes came already up during the initial PR review
3 Likes

LLMs are useful if you use them as inspiration. Using them for anything else is a recipe for disaster.

4 Likes

I maintain that we use autoincrement where we need it and nowhere else.

Not really, I believe. We use it in a couple of places where performance absolutely won’t matter, and the one that could matter (updating file entries) it’s correct imho because we need a new sequence entry (a required use of auto increment…). The mtime update could in fact be rewritten as an update, but there are no triggers involved there and it only kicks in under specific circumstances.

Yeah. This one seemed the most promising to me, but it benchmarks no difference on at least my computer. :person_shrugging:

2 Likes

From the docs:

If the AUTOINCREMENT keyword appears after INTEGER PRIMARY KEY, that changes the automatic ROWID assignment algorithm to prevent the reuse of ROWIDs over the lifetime of the database. In other words, the purpose of AUTOINCREMENT is to prevent the reuse of ROWIDs from previously deleted rows.

Do we really need to prevent reuse?

Yes, precisely that is an essential property. Sequence numbers must be strictly monotonically increasing.

(It is not as essential for folder and device IDs, but it makes it nicer for humans and there are typically <10 inserts in those tables over the lifetime of a Syncthing install…)

2 Likes

My apologies for any frustration caused. I wasn’t suggesting there were issues with our schema or code. I asked to illicit a discussion to gain a better understanding of our schema and design decisions, and maybe squeeze out a smidge more performance, if possible.

But based on everyone’s feedback, it seems the only real takeaway is we can avoid extra roundtrips using WITH. For example:

insertFileStmt, err := txp.Preparex(` INSERT OR REPLACE INTO files (...) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) RETURNING sequence `) if err != nil { return wrap(err, "prepare insert file") } insertFileInfoStmt, err := txp.Preparex(` INSERT INTO fileinfos (sequence, fiprotobuf) VALUES (?, ?) `) in folderdb_update.go could become

WITH ins AS ( INSERT OR REPLACE INTO files (...) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) RETURNING sequence ) INSERT INTO fileinfos (sequence, fiprotobuf) SELECT sequence, ? FROM ins;

I will submit a PR, if that’s of interest. That appears to be the only place we use RETURNING.

3 Likes