Pretty drastic amount of writes when doing initial scan


(Markus Stenberg) #1

I am not sure if I am doing something wrong, but basically for the duration of initial scan, the database was writing to disk … a lot, about 1/10 to 1/20 of scanned volume. So when I scanned ~800GB of files, my poor SSD got trashed 40-80GB of writes over relatively short period of time. And knowing how SSD write lifetimes are, I find it slightly disturbing.

Is it feature or bug? It seems like bug to me, e.g. the database writes should be within orders of magnitude of final database size and committing every small change to disk seems overkill.


(Audrius Butkevicius) #2

Sadly that is not how LSM tree based databases work. Because your workload is mostly inserts for the initial scan, my money is that what you are seeing is write amplification having to rewrite and collapse levels by rewriting them as more writes are happening.

If you don’t change all of your data, I don’t think you’ll see that after the initial scan.

We do not commit every change to the database, but batch some number of index records. I don’t know how the database underneath decides to flush that to disk.


(Markus Stenberg) #3

Yeah, I roughly know how LSM stuff works. But given larger write logs, the write amplification can be much less. It seems leveldb is not doing that well in this case though, as I see syncthing writing 10-50MB/s for a loong while both during initial scan and then ‘sync’ step (during which almost nothing happens but I guess it reconciles state with what is on the other nodes).

I wonder how much of syncthing CPU and/or I/O is spent on leveldb doing useless things.

( Not sure how many configuration knobs it has, but e.g. https://github.com/dgraph-io/badger has plenty which can be used to ameliorate this issue to some extent )


(Audrius Butkevicius) #4

Perhaps we ask leveldb to flush too often with small batches which causes very little to be buffered and causing large churn of levels, making write amplification worse.

I guess it’s a tradeoff for memory usage.

Don’t think moving to badger will win anything for pure write workloads, as the prinicple of operation is the same (LSM), both have buffers adjustable and the decision when to flush them is mostly in our code.


(Markus Stenberg) #5

Yeah I doubt implementations differ much. As syncthing in my case already uses 1gb+ of ram and the leveldb itself is actually smaller than that I would recommend less flushing or tuning leveldb to use longer log before flushing to LSM tree.


(Audrius Butkevicius) #6

Well, there is no one setting fits all, that would screw rpi’s badly.


(Bt90) #7

(Markus Stenberg) #8

Scaling it based on dataset size would be probably reasonable.

I guess scanning my 10(+)TB NAS would involve writing 1TB(+) of data to gather 10GB leveldb assuming it scales linearly (my current numbers seem 1/10 of that for 1 TB SSDs I am testing it with).

(I have had some 128 / 256 GB SSDs turn read-only (= and shortly thereafter die) so I am slightly leery of excessively writing applications)


(Simon) #9

This problem most likely has been made worse in https://github.com/syncthing/syncthing/pull/5441 (i.e. by me). Before that, flushes would happen every 64 files and every 1000 blockmap entries (independently). Now it happens every 64 files or blockmap entries, but maximally once per file, as both are written in the same batch. So essentially now a flush happens for every file with more than 64 blocks (i.e. bigger than 8MB). I just saw it’s possible to check the raw byte length of leveldb.Batch, so deciding based on that seems like the right way to go. I opened an issue about this (not about scaling it based on something, that could be a further step): https://github.com/syncthing/syncthing/issues/5531


(system) closed #10

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.