P2P Backup via Syncthing?

jcgruenhage · April 11, 2016, 3:40pm

P2P Backup

I wonder if it would be possible to implement a P2P-Backup solution using what we already have with syncthing. Of course you could just search for someone among your friends whose hard disk space you could use as an off-site backup, but it would be easier to accomplish off-site backup using an automated tool.

How it could work

A folder you add to syncthing can be set as just sync (like it is working now) or to sync and backup. This folder at first wouldn’t necessarily be synced, but instead at first just remains on the host. (Of course you could still sync it, but thats not the main purpose of this folder.) When you add files to this folder, it gets automatically encrypted, parity gets calculated and the the files get distributed to others using the backup feature. At the same time, you reserve some hard disk space for other people using the backup feature, linearly growing with the space you use in your backup folder. Now when your computer has a fatal hard disk crash and you want to retrieve your data you could just fire up syncthing and it would regain those files. Thanks to the parity, even if someone who had part of your data (encrypted of course) had a crash too, you could still recover your data. There are some tools that accomplish said task, like symform for example, but I personally don’t trust them with the claims about security they make (mainly because they offer a webinterface for accessing the files, that should be end-to-end encrypted and crypto in js is not a good idea) and I’d really like to use an Open Source solution for that.

How hard is it?

Since syncthing already has implemented encryption, there are (as far as I can see) only three things left:

Splitting the folder into small chunks (Not sure about that, but I think the data is already handled as little chunks by the protocol).
Building parity information for said chunks.
Distributing the data over the backup network.

Edit: Of course you would need some storage more available than currently used by the backup using users, so in the “early days” of the feature, you should use a higher disk reserving to backup ratio, so that there is always some storage available on the network. Using p+q parity for 3 blocks, you could loose 40% (2/5) of the backup and still be fine, at the same time, you would be forced to offer 66.6% (2/3) more storage (in a optimal scenario, without any file movement) and about 75% more in a working scenario, than you use. Going even further you could use p+q parity for every 2 blocks, which would increase the additional need for space by 100% and to make that realisticly work to what, 110%? Short: it would need a lot of space, but i wouldn’t mind giving away 5GB of space to others for securely backing up 1GB of very important docs.

Other benefits:

It would be a new usecase possibly attracting more users and/or developers.
It (c|w)ould speed up general file transfers, since the data is distributed over many hosts, what should spread the workload when loading data.

So… I just wanted to get that idea/feature request/whatever you might call it out there and (when I learned some Go, Java/some C++ only at the moment) help implementing it, if other people said they were interested too.

BengaloBongali · April 11, 2016, 4:15pm

I’m not completely sure, because I haven’t used it, but this sounds like Tahoe-LAFS… https://tahoe-lafs.org/trac/tahoe-lafs

jcgruenhage · April 11, 2016, 5:58pm

As far as I have seen, Tahoe-LAFS is something else, see quote from their description: “Users do rely on storage servers for availability.”

calmh · April 11, 2016, 6:19pm

Indeed. To me this says “out of scope”, i.e., something better implemented as a different utility that is not Syncthing. It could well speak the same protocol though.

jcgruenhage · April 11, 2016, 7:09pm

Yeah, you’re probably right different utility it’ll be then

lfam · April 11, 2016, 9:38pm

Syncthing doesn’t encrypt the data at rest, so your users would have to handle that themselves. Or, this new utility you are planning to write could add the implementation

andrewufrank · May 4, 2016, 6:21pm

i am not certain whether I misunderstand something, but syncthing does save previous versions of files when they change. if i have a syncthing server which is just used for sync and nothing else, for all files which are changed, i can find previous versions in the .stversion folder (setting “staggered versions”).

compared to a backup program like rdiff i see only a lack in the ability to restore the state for a specific point in time, but it cannot be terribly difficult to write a small program to read the filenames from .stversions and collect for each file the version valid at a desired time.

such a tool would be most welcome for all cases where one of the nasty crypto malware has encrypted your disk (and the encrypted files were synced…). then one could just collect the state from before the attack and restore the files from .stversion and sync!

any hints how this could be achieved?

sfera · June 3, 2016, 7:59am

could a combination of tools like duplicity - or a similar tool - to handle on-site incremental backups and encryption and syncthing accomplish the task? restoring a specific point in time would be the task of the backup tool, syncthing would be used for off-site backup propagation.

rumpelsepp · June 3, 2016, 5:04pm

I am used to syncthing + borg. My server node is backuped once an hour via borg, that depuplicates nicely in order do save disk space.

calmh · June 5, 2016, 1:50pm

Syncthing + automatic ZFS snapshots for me.

rumpelsepp · June 6, 2016, 11:55am

Dito; replace zfs with btrfs. Managed via snap.

nrm21 · June 7, 2016, 3:27am

What the OP is describing is basically a Distributed File System across many (or maybe just several) nodes which among other features may have: erasure coding, self healing and auto balancing (if a node goes down).

This could probably be done with the block exchange protocol but I’m not sure why one would want too. There’s already dozens of other FOSS (and commercial) products out there that do this. Of course most of them are B.Y.O.(clients or entire network of clients). But there are a few that kinda do stuff like that with strangers across the internet. Maidsafe and Storj come to mind and IIRC one of the uses the bitcoin blockchain protocol to help it.

Bottom line: It will be the 27th time this wheel will have been reinvented. There are already mature products that do this out there. Ceph is probably the best one to win out in the end for enterprises with all the corporate backing it has.

I personally use MooseFS at home among 5 clients (2 of which are raspberry Pi’s with hard drives attached). Not the fastest in the world with only 100Mbps NIC’s, but perfectly acceptable performance for backups.

More info on DFS’s:

wscott · August 31, 2017, 11:34am

I made a similar proposal on the Restic forum.

Restic is a deduplicating backup solution that encrypts all the data so the destination server does not need to be trusted. Perfect for backing up to friends. It also stores the data in a much more compact form than a duplicate of the file system.

What syncthing provides is all the plumbing for this sort of arraignment. It has a web API and p2p discovery and the ability to open a data channel between 2 servers.

Unlike file sync, for backups we can connect 2 users who don’t trust each other and let them share diskspace.

Anyway, this could be a separate project but it seems like it might not be too back intergraded with syncthing’s other features.