Why 2000 blocks per file?

mase · July 3, 2024, 8:16pm

Hi, I’m working on a prototype for a BEP compatible client (still very much work in progress GitHub - mseravalli/grizol) and I would like to learn more about the reasons why Syncthing wants to split every file in ~2000 blocks (syncthing/lib/protocol/protocol.go at aea7fa5f22924fed75571cd07564f9e057c7f64f · syncthing/syncthing · GitHub, it’s also in the BEP specification).

I have run some synchronization tests with ~40 GB of music with my client: with the default 2000 blocks per file, Syncthing was using around ~1.8 GB of memory and the transfer rates were around 5-10MB/s; when I tried using a custom Syncthing client with 20 blocks per file, the Syncthing memory usage dropped to ~300MB and the transfer rate increased to 50-100MB/s.

The initial transfer rate might also be due to the early implementation of my prototype, but the ~5x memory reduction on the Syncthing side seems to be a quite direct effect of the decrease of blocks per file.

The main question is: why was 2000 blocks per file selected? Was this selected based on specific data? If yes is the data available? Or was it based on the intuition that some blocks in a file (maybe the header?) might be the same for different files and thus be reused? Is this something that could be parameterized and changed for advanced users? I might be able to carve out some time to work on this in the future, but I would first like to see if this is something interesting for the community.

Nummer378 · July 3, 2024, 8:37pm

@calmh is probably the only one who knows all the details, but as a starter:

Initially syncthing only supported a single fixed block size of 128 KiB. This was changed in 2018 to support variably sized blocks, where the “2000 blocks per file” was introduced. The original issue has some discussion as to why excessively large block sizes were not the goal (note: the wiki referenced in the issue no longer exists, but here is an archived version).

Note that changing the block size of already scanned/synced files is undesirable, as it causes the block hashes to change and can cause effects such as not reusing existing data, or cause false conflicts in case of changes with mismatching block sizes. You always want all sides to agree on a matching block size.

The problem with gigantic block sizes will always be that you can’t effectively reuse local data - any change will cause huge retransfers, and you have more problems downloading data from multiple devices at once.

mase · July 3, 2024, 9:40pm

Thanks for the quick answer and sorry in advance for all the questions but I think this is an interesting space and maybe there is some optimization possible.

changing the block size of already scanned/synced files is undesirable

Agree, I was thinking more for new installations, which could be set before the first scan was performed or for more advanced users that have specific use cases.

You always want all sides to agree on a matching block size.

Would it be possible to expand on the problem? I thought that this information is in the FileInfo message (Block Exchange Protocol v1 — Syncthing documentation) so that once a client received a non standard block size for a file it would be using that afterwards. I agree this is not specified in the protocol, so it’s probably left to the implementor to decide what to do, does Syncthing ignore this field in for further rescans?

gigantic block sizes

I’m not arguing necessarily about gigantic block sizes, I think that 16MB is a reasonable upper bound also given general network constraints, however I’m not sure about the gains of always having a file split in 2000 blocks, more in the next point.

The problem with gigantic block sizes will always be that you can’t effectively reuse local data - any change will cause huge retransfers

I’m interested in exploring more this point actually, when a file is changed I’d assume that it’s likely that all the blocks after the change will need to be re-hashed and re-transmitted as I’d assume it’s quite unlikely that a change only affects the current block and not the following blocks: e.g. if I add a comma in the middle of a text file, this will shift all the text and all the blocks starting from the block containing the change to the end of the file will be different. In such a case, with plain text files, if we assume that on average 50% of blocks (due to insert/remove edits at a random place in the file) need to be re-transmitted for every change, would a much larger block (10x or even 100x larger) make a sizable difference in retransfer? Are other formats less susceptible to this problem? Is the assumption for the change kind valid or are people usually changing something like a date e.g. '2020-02-20' -> '2020-02-21' so that only a single block is affected? Is there any data available so that this problem can be better investigated?

you have more problems downloading data from multiple devices at once.

I think this is a fair point, is there any data available to explore more this? I think there is a gain here only if there is a large (>100) amount of clients with the same block, but if a block is shared only among a small amount of clients (e.g. 10), does a smaller block actually make a positive difference? e.g. I have a file split across 10 instances and the file is split in 2000 blocks I still need to issue 200 requests to all 10 instances, which would incur in a higher overhead, vs 2 requests for per instance. Is my understanding correct?

calmh · July 4, 2024, 6:45pm

The per request overhead isn’t particularly large. However, the block size becomes the minimum overhead for any smaller change, i.e., change one byte and the smallest amount that will be transferred is one block.

AudriusButkevicius · July 4, 2024, 6:59pm

And there is also the whole rolling hash business, which does allow you to change a block in the middle, download that one block, detect that all other blocks shifted down, and avoid redownloading them.

We have stats on data.syncthing.net but I’m nor certain that accurately reflects things, as I think that was implemented before variable block sizes hence might not take that into account and be wrong, hence how well that works in practice is hard to say.

But I agree with the sentiment that a 300mb block size means you probably have to have multiples of that of ram to receive it, and if you are running on some silly rpi zero, that might become a problem.

I think 16mb block should be enough to saturate the link.

Why more smaller blocks has such a large ram overhead, I can’t really say, there is probably something inefficient going on, or it might just be more gc churn that the gc doesn’t bother to collect, hence needs more investigation to answer.

Also note you can control “max requests in flight” which might be your limiting factor, which was put in place precisely because rpi’s were falling over when pulling 10 x 16mb blocks and running out of memory.

mraneri · July 5, 2024, 1:37pm

Is there a minimum block size and a maximum block size? I mean if you have a small file I presume the block size isn’t reduced to 3 bytes. That would be silly? What’s the smallest block size? I presume some small files are a single block. Also what’s the largest block size. For a multi gigabyte file it may make sense that this file is more than 2000 blocks.

AudriusButkevicius · July 5, 2024, 1:53pm

Min is 128KiB, max is 16MiB

mraneri · July 5, 2024, 2:27pm

Interesting. Makes me wonder if 2000 is the right number. I mean you’d need a 32GB file to reach the maximum block size.

I think the block size is reasonable but maybe typical files should be divided into fewer than 2000 blocks.

Anyway I don’t have a strong opinion either way.

mase · July 6, 2024, 10:34am

Thanks for the answers and the links!

Regarding large block sizes: I’m not arguing for having blocks larger than 16MB, I think that the current range is reasonable for reads from disk, memory management and networking. I also agree that it might make sense to split a file in some blocks, in particular larger files.

The discussion I’m mostly interested in is whether it makes sense to allow users to tune the default 2000 block split per file. I don’t want to change the default (not at this point in time at least ), I think that to change the default additional analysis might need to be performed. What I am interested in is to see if this is something that I should spend time working on or if this a total no-go, because there are some specific reasons that lead to 2000 blocks per file and changing this would break some invariant somewhere.

From my perspective, I think it might allow for some improvements:

Less overhead for the requests: IIUC it seems that replying to a request involves opening a file, jumping to the right position, read a block, close the file, possibly checking the hash of the content (syncthing/lib/model/model.go at aea7fa5f22924fed75571cd07564f9e057c7f64f · syncthing/syncthing · GitHub). I’m not sure if additional caching comes into play, but I think that performing less system calls and less but larger file reads might helpful.
Smaller indices: all the information about the blocks need to be stored in the index, I’m not sure if all the index is constantly stored in memory, if we assume that that’s the case, having a smaller index might then reduce the memory usage, if the index is not always stored in memory with a smaller index we might need to read less data from disk. I think there are possibly additional benefits from a smaller index, but I think that here I’d be speculating too much.

Scenarios where this might be a benefit:

Better backup experience on mobile: my guess is that in the case of mobile devices, a common use case is to use syncthing to backup the media. In this case the data goes out from the device, to probably not come back in any time soon. Therefore, keeping things is sync and file editing might not affect the experience as much. By having smaller indices (i.e. resource footprint) and less overhead for the requests the users might be able to speed up their uploads.
Media synchronization: the assumption is that media like songs and videos won’t be edited very frequently if at all, therefore the editing and partial block sync should not apply in this case. As discussed before, larger blocks might actually allow faster syncs with less footprint, in particular in the case of larger media collections. Also from the stats in data.syncthing.net it seems that the p95 of the users has clusters with 10 instances or less, therefore more blocks might not lead to a greater parallelism for download/upload from various devices.

I agree that these improvements don’t come for free: they are a tradeoff and definitely there might be some overhead added in other areas, e.g. when only a small portion of the file is changed. I think however that allowing the users to operate on this parameters might provide some benefits for some usage scenarios.

What are the thoughts of the maintainers on this?

mase · July 6, 2024, 12:18pm

Another interesting thing to notice is that the p50 for Files Managed per Device is 3.7 k and the p50 for Data Managed per Device is 13.6 GiB, if assume that the average file is 3.6 MiB (13.6 GiB / 3.7k) we see that on average the files use the smallest block size (128KiB) and the are divided in ~29 blocks. I know this is a stretch and this is a simplifications, but what I am trying to point out is that at the moment it seems that the selection of 2000 blocks might not affect an average user that much therefore the problem did not pop out sooner. However, I think that for some use cases e.g. media collection syncs might have a not be as performant as they could given the current split. The assumption here is that the 2000 block split mostly affects larger files which are most likely media files which are most likely immutable and for immutable files a smaller block seems to mostly create overhead.

mase · July 6, 2024, 4:59pm

A bit tangential to the topic of whether desired blocks per file should be parametrized. Anyway, I think it’s interesting that ratio of total size/total files seems to be basically the same also for the p95 (859100/275700~=3.1MiB). This maybe could be interpreted as the fact the users mostly store pictures/mp3? Most likely this needs to be better investigated but I think it might open some door for some specific optimization. Sorry for steering away from the main topic, maybe this could be further discussed in a different thread.

AudriusButkevicius · July 7, 2024, 11:04am

I think making it configurable is tricky, as the files no longer become comparable and cause weird side-effects if configured differently on different devices.

imsodin · July 7, 2024, 11:23am

I’d even say that it mustn’t be configurable. Besides weird side-effects, a definitive effect of the files not being comparable is that the full file will be retransferred on every edit (even mod. time only). We had reports of people with files >250MB which were edited long time after introducing variable block sizes and that caused a full retransfer triggering the confusion/report. So we’d need some communication to ensure the same block sizes accross all of them, raising complexity.

And as for changing block sizes/numbers generally: That’s a lot of work and risk, so needs strong motivation. As in probably dig up info from when this was introduced, likely there’s some reasoning around the sizes/numbers chosen there. And then if you believe you have reason to change that, have some actual numbers for performance improvements (profiling, benchs) and/or estimates for other effects (index sizes, block re-use effectiveness, …).

Som quick thoughts on the points you brought up:

Index disk usage (clearly not in memory all the time ): Worst case ratio of disk usage by files to size of hashes in index is for 250MiB files (smallest file for 2k hashes). In this case 1TB of such files results in 4000 * 2000 * 32B ~= 250MB of hashes in the index. Seems reasonable to me.

Doing 2000x disk operations instead of fewer times on request: I could see this improving things, but it could be tackled less intrusively by allowing to request multiple consecutive blocks at once.

mase · July 9, 2024, 7:58pm

I’d even say that it mustn’t be configurable.

Just to be on the same page is such a change something completely out of question or maybe the opinion could be changed with enough data?

So we’d need some communication to ensure the same block sizes accross all of them, raising complexity.

I thought that that would be covered by block_size of the FileInfo message (Block Exchange Protocol v1 — Syncthing documentation). My understanding was that the latest version of the file info specifies the block size that should be used across the cluster.

dig up info from when this was introduced, likely there’s some reasoning around the sizes/numbers chosen there.

I was trying to use this thread to find more info about that, but it seems additional archaeology is needed

have some actual numbers for performance improvements (profiling, benchs) and/or estimates for other effects (index sizes, block re-use effectiveness, …)

Sorry for asking but before someone re-implements the wheel, is there already a framework for running performance tests, so that a change is made, a command is run and the data are already collected and aggregated? I know that some organizations have such infrastructure in place to track e.g. performance regressions.

For the RAM utilization, I think we are in the ballpark of 250MB, but 250MB would be a lower bound, IIUC BlockInfo would be actually tracked which is slightly larger (48B if the data are aligned in memory) and the blocks will probably be stored in a dynamic structure that will most likely require more capacity than the data actually stored. Anyway, I admit that some profiling and hard data should be provided before moving forward with this specific discussion.

I agree that the 2000 disk operations might be reduced through some improvements, but I don’t know the additional complexity and resource usage that this would entail, in particular for larger files. I’d assume the added complexity would be non trivial, but probably I’m not the most appropriate person to provide the best estimate

mraneri · July 9, 2024, 8:32pm

Really I think the key is not the 2000 blocks per file bit. While the min and max block size wasn’t brought up until later in the discussion the key point is the minimum block size. 128k is probably reasonable. And given 128K is reasonable only files larger than 250MB are affected by the block COUNT… and I guess the percentage of files managed by syncthing which are >250MB (and less than 32GB) for most people is very small… so this is probably a non issue for almost everyone.

And certainly going back to the very original post about a music library, the discussion started talking about the count of blocks. But music files are a few MB each to maybe 20-40MB in the case of lossless? So the block count was never 2000 for those files. Of course when the block count was reduced to 20 then the block size increased, so for sure there was a change…

But perhaps this should be as much about whether the minimum block size should be 128k vs whether the block count should be 2000. Because the vast majority of files including probably 100% of the OP’s music library is composed of files with substantially less than 2000 blocks. (Probably 30-300 blocks actually.)

mase · July 9, 2024, 8:42pm

For completeness in my music library I have several files larger than 100MB (I’d estimate in the lower hundreds) as I store many tracks that are multiple hours long, In this case I think I’d be still impacted by a large amount of blocks per file as there should be 1000 to 2000 blocks per file for ~20/30% (I guess) of the total data. On mobile at the moment, if needed I can provide more data.

calmh · July 10, 2024, 8:09pm

Very near out of the question. Whatever data you would want to collect you also need to consider the bigger picture. There are about a million Syncthing users, with small files and large files; files that change all the time or never; files that change in their entirety every time, in small sections in the middle, or are appended to continuously; with very fast disk I/O but expensive or slow network; with very fast and free network but limited disk I/O; with very fast CPUs and lots of RAM and the opposite; etc.

I’m certain that whatever block size we pick isn’t optimal in all cases, so I’d settle for something that is functional and not enormously detrimental in common usage.

Having this as a tunable, with quite complex behavior to reason about and multiple tradeoffs in different dimensions, and which needs to be set identically for all devices in a cluster, seems like a bad idea to me.

imsodin · July 11, 2024, 1:37pm

Regarding configurable block sizes or not: I think the discrepancy in whether that’s a good idea/doable or not comes from a misunderstanding:

That’s not how it works. For a given version of a file, remotes do accept any block sizes according to that BEP field. However when a file changes, it gets hashed with the block size fitting the (potentially changed) size. Which is good, otherwise an initially small file can grow and stay at the smallest block size (or just be replaced by a huge one, after all it’s just a name).

If we change the block size, it needs to be univeral (well for future versions of course). And as Jakob wrote above, that’s a capital IF.

There’s no such infrastructure. There are some benchmarks (I wouldn’t trust all of them), and they are hardly relevant here. The basic effects of a change are simple to reason about, the complexities here stem from an abundance of different use-cases and systems. I also don’t think the first step should be benchmarks/testing, but some motivations accompanied by back of the envelope calcs wherever that’s helpful. Just because that’s likely to be much more time effective (just my opinion though, and I am not doing it ).

Where are those 250MB from?
I assume not from my calcs, as they weren’t about memory usage (index is not in memory).
You did mention high memory usage in your tests before. That’s actually something where you can quickly get some more info what that is made up of, by using the built in profiler to get a heap profile: Profiling — Syncthing documentation

It’s probably not super simple, but it’s much easier to reason about (and change/revert/… later if necessary) than changing block sizes which affects state everywhere and causes transitions (spread over time whenever a file is changed, unless you want to reindex everything at once (probably not)). It is a new or changed protocol request, and the puller needs adapting (there is logic about splitting block lists into continuous sublists based on available devices already). As far as I see this would address all of the performance motivation you brought up so far. I am still not really convinced by the index size argument (obviously it would be smaller, I just don’t see that as very impactful).

mase · July 11, 2024, 6:40pm

I think the discrepancy in whether that’s a good idea/doable or not comes from a misunderstanding

agree, it seems that’s the case

when a file changes, it gets hashed with the block size fitting the (potentially changed) size. Which is good, otherwise an initially small file can grow and stay at the smallest block size

I think had a different understanding about this point because from the spec:

An implementation MAY deviate from the block size rule when there is good reason to do so. For example, if a file has been indexed at a certain block size and grows beyond 2000 blocks it may be retained at the current block size for practical reasons.

However, yes, this is just an example and not the documentation of the implementation.

All in all, it seems this is working as intended: Syncthing provides an implementation of the protocol, the implementation went for some trade offs and although the protocol is supported there are still some edge cases that could create some friction, so probably better not to parametrise the desired block amount in Syncthing, al least of now

It seems that for the time being I’ll keep using my fork to sync the bulk of my collections and leave the main Syncthing implementation as is

imsodin · July 12, 2024, 4:10pm

That part is not saying that the block size will stay the same, indefinitely. Just that someone dealing with a remote file cannot depend on that file adhering to that block size table (it must adhere to the block sizes though). What syncthing does in practice (I think that’s even documented somewhere) is to add hysteresis: It requires the size to go X% above or below the threshold before the block size changes. This is to avoid a file hovering around a threshold to change block size back and forth (e.g. huge log file with auto-pruning).

Generally it’s more about avoiding bad situations. The above is one, and staying at the same block size regardless obviously also creates a bad situation. We don’t want to keep a 16MB block size for a 1MB file or vice-versa.