Huge storage sync

I work for the National Library of Florence. We are building a geographical distributed storage for electronic printed material. We already have 3 nodes of 32TB, at the moment and we are considering the best way to “sync” the 3 nodes each other. We need an affordable syncing solution and we think that torrent protocol is the right choice. In our storage speed is not the main interest, a file wil never be deleted.

I just wondering if your system is able to manage a large storage with a huge number of 200MB files in way to duplicate a file on every node.

I would like to talk about it if someone is interested. Our goal is to build a “self heal” environment where every node is the exact copy of the other nodes; where if a node fails it can be rebuilt automatically; where if we need to add a new node it will be built automatically; where i a file is corrupt or lost it will be restored automatically

What do you think about it? You can contact me at ccorsani@gmail.com or on this discussion.

Best Regards (sorry for my english)

Sounds like it would work. That ends up being something on the order of 170k files, which is currently above the limit but will be fine in v0.8.6. It will require some RAM to keep the hashes for the files and things will probably go a bit smoother if you can split your 32 TB into more than one repository.

Calmh, thank you for your kind answer.

Unfortunately our storage will increase. On each node we use, at the moment, glusterfs.

For us speed is not important (kind a “static preservation storage” not a dynamic one) so why not consider to keep hashes on filesystem rather than RAM? We do not need to know instantaneously a “new file arrive”, we can sync it hours later, the most impostant thing is that will be synced and even one file won’t be lost in case of partial or total node failure.

Probably we are going to take source code to understand how it works and, if possible, modify for our needs. Just wondering if taking it as it is works.

Yep, it’s considered. Only so much free time, etc.

If you don’t care about performance, couldn’t just setup a giant swap file?