syncthing for replication of 40TB

n0name · October 9, 2019, 12:33pm

Hi,

I would like to use syncthing to replicate 40TB of data across regions.

Around 100GB of data is changed daily from all nodes.

Moreover, I have a http service to allow downloads/uploads. When an upload happens I thought of calling syncthing api with the file that was changed.

No scan intervals or inode setup happen since I know the files that are changed and at which node.

Problems I have seen so far are

when I start/upgrade/restart syncthing the initial scan takes too long thus I cannot scan for new files. The only thing I can do is write the files that have changed in another file and sync them after initial scan is done
when I upgrade the index db size gets huge (300+GB). Not sure if this is because I restarted nodes at the same time. Before restart db size is around 60GB.
when syncthing syncs files on a syncthing node I cannot perform scans for new files

I’m writing to get your opinions. I’m thinking that syncthing might not be the best solution.

Kind regards,

calmh · October 9, 2019, 12:48pm

Scanning and syncing are indeed mutually exclusive, per folder. Depending on your layout you might be better off splitting the 40 TB into multiple folders to get some concurrency.

Upgrades mean full index transmissions, which is painful for a large amount of data. It’s mostly an action due to bug paranoia, might be something we can do something about. But it does cause (temporary) index / database bloat, until the database gets compacted again. Given that this sounds large scale / commercial you might also better off “certifying” a version that works for you and sticking to it for longer than our usual release schedule.

AudriusButkevicius · October 9, 2019, 1:27pm

If you know when the uploads happen, why don’t you just send the new file to everyone yourself? I don’t see much value add in syncthing. It will probably be slower than just pushing that file out via rsync.

fragtion · October 15, 2019, 7:00pm

40TB does seem pretty extreme even by today’s standards xD Unless it’s all on SSD?

I know rsync supports some command-line parameters to speed up filesystem scans, such as only comparing file modified timestamps, file size, and ignoring checksum scans

Perhaps syncthing could eventually have a similar config option to do rapid scans of “already indexed files” if the path, size, and modified time are unchanged from the last index, which will significantly speed up subsequent scans and make massive repo syncs like this one, more practical. I don’t suppose it does this already? I would imagine not, as most users would prefer a full checksum scan each time by default?

calmh · October 16, 2019, 6:32am

We always do precisely that, as does rsync by default. Anything else would be unworkable.

Matthias · October 16, 2019, 6:47am

If you know what, when and where changes I would go for rsync or similar. Just for info, we’re syncing now ~ 23TB and about 8.000.000 files, and it works pretty good and fast.

andrew2 · October 21, 2019, 5:47am

40TB will work fine, so long as all your files aren’t in 1 or millions of directories. See my experience with 250TB and tens of millions of directories, optimizing large data sync. The storage subsystem and then filesystem are likely your first bottlenecks.