I’m using Syncthing for doing read-only content replication for mirrors. I know this isn’t quite the intended use case but I thought it might be worth writing up what we’ve found so far.
First up, the data. There’s four different data sets that I’d like to use with Syncthing, and I’m using it on two so far to get some experience. Until recently we were using rsync exclusively, and had a brief encounter with btsync.
The first set is around 70GB, most files are in the order of 100MB to 1.5GB. The second set is smaller (4GB) with small files - a few hundred kb through tens of megabytes.
The data is being replicated all around the world (East Asia, North America, EU, West Asia, etc), we’re seeing the worst of the internet misbehavior. Things were getting out of hand with rsync when we had to do transfer rate benchmarks periodically and move the rsync mesh endpoints to find better traffic characteristics.
I haven’t looked too closely yet (I know that we can see the pending transfer list via the REST api), but when a new bulk directory of thousands of files is scanned, it looks like all the clients try to fetch the same files, in the same order, from the one master that had it, rather than trying to get some sort of p2p flood behavior. Am I imagining this? Is there room for adjusting this if so?
Our experiments with btsync a while ago did show this (in between crashes) - the master with the new blob of data would distribute fragments to all the different peers rather than the same data to every peer.
The second thing that caused a great deal of surprise was the master mode definitely did not do what I expected. What I was looking for was something like the btsync readonly vs readwrite semantics. But right now if a (supposedly) readonly mirror damages a file, the others will replicate the damage. The master merely activates an optional overwrite button that somebody has to go and click.
I think we worked around this in the interim by creating a second staging master, putting all the syncthing mirrors into read/write/non-master mode, and rsyncing over any changes that appear. (ie: something coredumps into a mirror, it gets replicated, then the rsync goes and removes it and the removal gets replicated. No button click needed. It does seem sub-optimal though when what I think I want is some sort of auto-override behavior.
The third discrete data set is around 2TB, about a million files. We’re replacing about 200GB per day of these files and are waiting for mirrors (rsync based) to catch up before moving symlinks around. The drop size of new files is around 50GB at a time. I have suspicions that this will be too much abuse for syncthing and too far outside its design intent. btsync wouldn’t have worked either - it would have spent all its time rehashing. This volume is highly dynamic. Think of OS package build farms feeding this.
The fourth set is around 1.5TB, 2.7 million files and is mostly static, with another 3TB/6 million files on the side that is almost completely static.
I’m pretty sure the data profile of set #4 would be just fine. It would presumably take a while to index and converge, but I haven’t actually tried it yet. I’m really worried about the readonly replica problem on #4 though. I don’t have space to keep a non-syncthing source-of-truth online to have a continuos rsync any changes in the remote replicas.
The reason I’m looking for something better than an rsync mesh is that I’m looking for something to automagically adapt to internet bottlenecks and stop babysitting it. When switching rsync endpoints around is the difference between 10KB/sec vs 40MB/sec in throughput means we have to do it.
So, that’s what I’m trying to do. I realize that this isn’t quite syncthing’s goal (ie: make the replicas look EXACTLY like the masters, no matter what, and immediately undo local changes). Is this something that I can expect to have to fight with syncthing over, or ways it could be tweaked to get behavior closer to what I’m after?
Looking over the REST docs, it looks /rest/completion might give me the status info that I would need for doing some of the state changes for dataset #3 (or wherever it moved to).
TL;DR - I’m looking for more randomness in pull order from clients to maximize p2p throughput, and read-only slaves that can never initiate changes into the cluster (or another way of simulating that)
Any other thoughts? Am I even looking in the right place?