Idea: Use special new ignore pattern for sharding/partitioning of the data. This allows to achieve a replication factor of e.g. 3, distributed on a 3+ synchting connected nodes. The idea is to use the hash of the filename to equally distribute the data between the nodes. Each node gets its personal partition / shard assigned by specification of a ignore range for the filename hashes.
I found these rather old discussions by a quick search, still I think the idea of using the hashed filenames seems to be new. What do you think about this?
2014: sharding on multiple servers, guaranteeing a specifyed redundancy.
2016: sharding to allow shared data-backend (network-filesystem) on multiple nodes
Is the point to speed up transfers by splitting traffic?
Maybe I’m not understanding correctly but it seems each machine won’t have the whole file. Then if one machine goes down you can’t receive the file?
Can’t syncthing request different parts of a file from different machines already? Doesn’t this provide the same bandwidth benefit? Maybe this can be improved. But this seems a better strategy than basically striping the data across multiple machines.
Ultimate redundancy is already achieved in the existing implementation because all machines have the whole file.
So, I’m personally not particularly excited about sharding and such, however we get a lot of requests for more advanced ignore patterns. This is sort of a special case of that.
I’ve been thinking of adding something like CEL as an expression language for ignores. It’s often used for data validation and quite well suited to process a bunch of attributes and return a true/false state for an ignore pattern. For your specific case, if we did this, we’d have to provide a hash function for the filename (this is not a CEL builtin afaik) and then you could set a “pattern” like
hash(item.name).matches("^[0-8]")
(We could then also provide the capability for the classic footguns people are asking for like ignoring based on size etc. Many of those use cases are a bad idea for various reasons, but perhaps that could just be explained in a FAQ entry… Ignoring based on name or hash-of-name doesn’t have any of those problems though.)
For clarification:
With this proposal, one single file is not splitted. Data is distributed based on filename. Each file will either be completely there or not there at all.
Also, a main point of sharding is to allow to store more data in a folder than what would fit to a single node of the cluster.
Copy-Paste from PR description for even more details:
“Sharding” is meant to describe a technique that distributes a large set of data onto multiple nodes of a cluster.
Benefits are
that the total amount of data in the dataset can be larger than the memory of each individual node.
that the write and read operation of data of the dataset can be distributed and such be faster than with a single node.
that if desired a redundancy level can be achieved such that if multiple nodes fails, still all data can be read. This comes with cost of write speed and reduced maximum total size of the dataset.
This PR allows to use special new ignore-patterns to define sharded data distribution based on the hash of the filename.
This has the benefit of more random and equal distribution of the data as compared to pure filename based sharding.
CEL sounds like a good idea.
Its for sure more readable as some of the existing patterns.
Using regular expressions to specify the range would probably also work. I never thought about this…
If we can assume that the range a-f will always be lower case than this would be quite easy. Even combining ranges as e.g.