Thanks for the quick answer and sorry in advance for all the questions but I think this is an interesting space and maybe there is some optimization possible.
changing the block size of already scanned/synced files is undesirable
Agree, I was thinking more for new installations, which could be set before the first scan was performed or for more advanced users that have specific use cases.
You always want all sides to agree on a matching block size.
Would it be possible to expand on the problem? I thought that this information is in the FileInfo message (Block Exchange Protocol v1 — Syncthing documentation) so that once a client received a non standard block size for a file it would be using that afterwards. I agree this is not specified in the protocol, so it’s probably left to the implementor to decide what to do, does Syncthing ignore this field in for further rescans?
gigantic block sizes
I’m not arguing necessarily about gigantic block sizes, I think that 16MB is a reasonable upper bound also given general network constraints, however I’m not sure about the gains of always having a file split in 2000 blocks, more in the next point.
The problem with gigantic block sizes will always be that you can’t effectively reuse local data - any change will cause huge retransfers
I’m interested in exploring more this point actually, when a file is changed I’d assume that it’s likely that all the blocks after the change will need to be re-hashed and re-transmitted as I’d assume it’s quite unlikely that a change only affects the current block and not the following blocks: e.g. if I add a comma in the middle of a text file, this will shift all the text and all the blocks starting from the block containing the change to the end of the file will be different. In such a case, with plain text files, if we assume that on average 50% of blocks (due to insert/remove edits at a random place in the file) need to be re-transmitted for every change, would a much larger block (10x or even 100x larger) make a sizable difference in retransfer? Are other formats less susceptible to this problem? Is the assumption for the change kind valid or are people usually changing something like a date e.g. '2020-02-20' -> '2020-02-21'
so that only a single block is affected? Is there any data available so that this problem can be better investigated?
you have more problems downloading data from multiple devices at once.
I think this is a fair point, is there any data available to explore more this? I think there is a gain here only if there is a large (>100) amount of clients with the same block, but if a block is shared only among a small amount of clients (e.g. 10), does a smaller block actually make a positive difference? e.g. I have a file split across 10 instances and the file is split in 2000 blocks I still need to issue 200 requests to all 10 instances, which would incur in a higher overhead, vs 2 requests for per instance. Is my understanding correct?