I’ve been experimenting with cloud storage for large files, in particular volume containers and VM images, and have been looking for a workable solution. FWIW I have been reasonably pleased with Dropbox for this task – I experimented with a moderately large volume (128GB), and although the initial upload took a long time, incremental updates to the file were transferred fairly quickly. Dropbox’s delta synchronization seems to work well.
Of course, I really want to use my own storage rather than a third party provider, so I’ve been looking for an open-source platform that does delta synchronization. Syncthing appears ideal overall, and although it’s still a bit rough I’m really impressed with it for the most part.
However, last night I was feeling confident that my setup was working correctly so I decided to do my test on a large file. This one is 500GB, a stress test to see how quickly it would synchronize across a gigabit LAN connection. It is the only file in its shared folder.
The GUI showed a folder status of “Scanning” for several hours, which is a bit disappointing compared to my earlier experience (Dropbox starts actively transferring/updating data almost immediately). It then suddenly changed to “Up to Date”, with a state of 0 items, 0 bytes rather than the one file I would expect. No significant network traffic during this time, and I could find no relevant information in the console log. I “touched” the file to update its timestamp and waited a couple minutes to see if it would try again, but it did not return to the “Scanning” status.
I’ve moved the file out of the directory and back in again and it is now once again “Scanning”, though I don’t expect different results when it is finished.
Not really, without any information at all in the logs. There’s various debugging that can be enabled, but before that - it didn’t run out of memory and crash or something? Several hours for 500 GB sounds not unreasonable, but a tad on the long side, so I’m guessing this may not be the most powerful machine around?
I was actually wondering about memory in particular. This is running in CentOS 7 (second run is still ongoing, btw), and I’ve had its basic resource monitor application running the whole time. Memory usage is flat – it doesn’t seem to be allocating any additional memory while it runs.
Using ‘iotop -oP’ I can see that syncthing is definitely reading the file, but it isn’t writing anything. So I’m a bit puzzled – what is it doing?
At any rate, you’re right – this isn’t a super fast machine. It’s a few years old, a 6-core AMD PhenomII with 16GB of DDR2. The OS is running on a 850Pro SSD, but unfortunately the file in question is on a larger spinning disk.
One additional question – if it succeeds, is this scanning process expected to happen again after every modification to the file?
I am monitoring the process remotely, so I can’t see the disk activity directly.
What I can gather from here is that a couple hours in, memory allocation to the process has been about 950MB, and CPU utilization has been about 12% or so. This has been pretty continuous. Disk reads I recall seeing at about 50-70MB/sec.
But I can’t verify at this moment, because …
What’s interesting is that in the past few minutes as I’ve been writing this jumped from about 1GB to over 4GB, and CPU utilization has risen to 22%. (How many threads does the process use?)
And … now it’s at 5GB and 0 CPU. GUI now says Up To Date. This might be the failure sequence that I didn’t get to see last time it ran… ?
Memory allocation has stepped down, a couple minutes later, to 2GB.
OK, a third run, this time a little more careful with the details. After an hour it’s using about 14-16% CPU, about 930MB of memory, and disk reads are fluctuating between 50 and 70 MB/sec, staying mostly below 60.
My guess is that 16% CPU might mean nearly 100% of one core out of the six available, if this part of the process has only one thread. It would make sense that this might be CPU bound.
Thanks for the suggestion! I set it to 4 and restarted.
It’s obviously making a difference, as both CPU and disk IO are higher, but the difference is only by a few percent. CPU is around 15-16 percent, making brief excursions to 18-19%. Disk IO is increased to around 60-80 MB/sec.
My hunch is that it’s IO bound at this point. The disk’s manufacturer’s sustained transfer rate spec is 110MB/sec, but of course there may be a lot more going on here. Might see some benefit from trying a newer/faster drive or testing on an SSD.
Anyway, performance issues aside, I’m still curious about why it’s failing at the end, and what I can do to diagnose the problem.
Definitely more output, but unfortunately I still can’t find anything relevant. The directory and particular file are mentioned at the beginning of the scan, but there is no further mention, throughout and well after the process fails.
Ah. Figured it out. We currently only handle up to 128 GiB files, due to a limit on the length of the block list. Ideally, to handle larger files we should have variable (larger) block size, but in the meantime we can increase that limit. Although as you’ve already noticed the work required to figure out what actually changed when there’s an update to this file makes it somewhat impractical even when we fix this…
Is there any possibility of eventually pipelining these operations, so that data transfer could begin while the hash is still being calculated? I’m wondering how Dropbox manages to start transferring so quickly after a change to a large file.
I might be able to live with the 128GB limit for now, though there are a few use cases I can think of where smooth handling of very large file synchronization might be useful. (4K nonlinear video editing projects come to mind.)
At any rate, thank you for looking into the problem. I will continue to experiment with it.
Theoretically yes, at least in the case where the other devices don’t have the file at all. Then they don’t really need to know about all of the file before beginning to download it.
As for detecting what parts of the files changed quicker, I don’t know. We could use a weaker (thus quicker) hash over smaller blocks, and then just be limited by the disk speed while homing in on the changed regions.