Increase the transfer speed for small files

vince · March 19, 2015, 8:53am

Hi,

Transferring a large file is much faster than transferring lots of small files. It would compact the small files together to increase the transfer rate and unpack on arrival. Do not necessarily use compression that can be slow on some devices.

AudriusButkevicius · March 19, 2015, 9:26am

You can adjust number of copiers in the config.xml to parallelize this better for a given folder. You can also raise the number of pullers, but make sure copiers do not exceed pullers and that there are always spare pullers available.

vince · March 20, 2015, 8:04am

Thanks for your reply. What I have to add to the config.xml to increase the number of copiers and pullers?

AudriusButkevicius · March 20, 2015, 8:05am

There are settings for pullers copiers for each folder, you just need to increase them

joseluis · November 3, 2020, 3:32pm

Hi! I’m new in the forum, but user of Syncthing for more than a year now. We use it in our small company.

I also experience this tiny files slow sync issue (I know it’s a topic that has been discussed in the forum several times). I’m talking about folders which handle around 1.7M files, 217k folders, aprox. 170GB space (a mix of small and big files, but mostly small). Our main issue is the ‘first sync’ of a new device, which may take literally days to complete.

I want to add a pair of questions to the table:

if increasing the n of copiers increase the overall sync speed for small files, why not handle it dynamically when syncthing check that is copying a lot of small files?
is there any more room to improve the ‘small files slow sync’ nowadays? any plan in the roadmap for it?

PS: Thanks for the job you do, even though I’m talking here about an issue, the general experience with Syncthing is excellent

calmh · November 3, 2020, 3:50pm

I doubt the number of copiers has any effect. Likely it’s just the overhead of handling a file, including fsyncing it. At times we’ve had an option to disable that, and I don’t honestly remember the current state, but it’s something you could search for.

AudriusButkevicius · November 3, 2020, 6:36pm

There is still an option to disable fsync in advanced settings. Beware tho that might lead to data corruption if you abruptly lose power etc.

joseluis · November 4, 2020, 5:45pm

Hi, thanks for the answer.

I don’t want to sacrifice data safety in any case, so that door is closed.

My point is more to see (or find) if there are new ways to explore.

Lets start explaining my use case: we have a kind of ‘client-server’ setup, where we have a main server with all the shared folders set to “Receive only”, and each client has only 1 shared folders on the client side.

From other posts, I read that one of the problems of delaying/disabling the fsync per file is that if the register of the file is written in the DB but then while the file is being written to the disk there is a power outage, the DB has the file reference but not the file in the FS, so it may think that the file has been deleted and spread the deletion to clients.

However, if this is correct, what if the DB register is written after the file being written to the disk? and if this is done in bulks when handling small files? I.e.: if syncthing node see that it has to receive 100 files of 1KB, do this steps:

1.- receive all of them write them to the disk (no fsync yet)

2.- fsync the files

3.- write to the db

4.- if everything is correct go again to step 1.

If a power outage happens during this process, in the worst case we’ll have a bunch of (maybe) corrupted files from step 2, isn’t it?

An option would be: if we can have control while a bulk transfer if happening, we can know which files were involved in that transfer and consider them ‘untrusty’ after the node is restarted. (i.e.: deleting those (or their rests) and requesting them again to the other nodes.

Somehow we need to do the whole bulk operation transactional, we need to set a register when it start, know what needs to be transmitted, know when everything is finished ok. If any of those fails, on each syncthing start we should check if any transaction happen to not finish and cleanup the disaster.

Pre-apologizes if I’m saying something totally wrong or that you have already discussed. I’ve read a lot in the forum but maybe I missed a concrete post where this was already commented

imsodin · November 4, 2020, 6:32pm

Is it clear that this will help significantly, or even at all? I don’t think there’s anything fundamental blocking us from batching up fsyncing. However in the end it will still be one fsync operation per file, which is likely still slow.

joseluis · November 4, 2020, 6:41pm

Isn’t there an alternative to fsync for all the files instead of file by file? I mean, in linux you can simply run ‘sync’ on the command line and that will sync all pending writes to the disk

Is it clear that this will help significantly, or even at all?

No, I’m not sure about it, and I don’t have the skills to implement a test for it quickly. However, if the current option of disabling fsync increase the performance, and it’s possible to do a sync for all pending writes and not just per file, it would be a performance improvement for sure.

However in the end it will still be one fsync operation per file, which is likely still slow.

Another thought about this: even if there is a fsync operation per file, and this takes time, all the info is already received on its side. Unless there is a power outage or other situation where the transaction is broken and the data need to be considered untrusty, the sender doesn’t need to send the data again.

imsodin · November 4, 2020, 7:07pm

I am not questioning fsync being slow, just asking if batching fsyncs helps. Calling sync on a file is supported by go’s os package, but there’s no sync. We’d have to implement that for all platforms, if it even exists outside of linux. Plus it has the disadvantage that it will flush everything, i.e. potentially writes not related to Syncthing.

calmh · November 4, 2020, 7:32pm

Also I think it’s not really about data loss per se, it’s that our state written to disk should match the state recorded in the database, or we’ll make incorrect decisions when starting up and potentially undo recent file changes.

We try to ensure this by fsync of changes to disk in close proximity to committing the change to database. We can delay the fsync but then there’s a potentially long time when the database is ahead of reality if there’s a crash.

Also none of this is compelling to me for a “first sync with lots of tiny files” use case. Yeah it takes a while, but it’s only one time for the lifetime of the setup. It’ll be done probably before we’ve finished talking about it, certainly before any patch is deployed, and then it’s a non problem.

If you still feel it’s an actual problem you can disable fsync for the duration of the initial sync and then enable it again. Also makes it a non problem.

AudriusButkevicius · November 4, 2020, 8:25pm

The problem is that fsync on different filesystems does different things.

I think fsync(file) on ext4 effectively is sync (sync whole drive).

There is fsyncdata which you can call in different modes (async after each block write, sync on file close), but you still need to fsync the directory in which the file is created, leading to the same problem.

joseluis · November 4, 2020, 10:13pm

Answer to imsodin:

Calling sync on a file is supported by go’s os package, but there’s no sync. We’d have to implement that for all platforms, if it even exists outside of linux.

As I said before, if the worst case is running all the fsync after a batch of files has been received, that would be an improvement already.

In the worst case: power outage, data corruption, etc. the action will be to re-request the data. And that’s not that bad IMO.

Plus it has the disadvantage that it will flush everything, i.e. potentially writes not related to Syncthing.

Is that a disadvantage? why?

Answer to calmh:

We try to ensure this by fsync of changes to disk in close proximity to committing the change to database. We can delay the fsync but then there’s a potentially long time when the database is ahead of reality if there’s a crash.

I agree with that approach for single files, I’m just suggesting that could be handle in a different way if a “batch of files” transactional sync is done, as explained in one of my comments above.

The objective is the same, that the change in the DB and the write to the disk are consistent. You just need to control in parallel which files in the bulk transaction has been written and committed to both, if you can do all transaction ok, is fine. If not, re-request the data.

Also none of this is compelling to me for a “first sync with lots of tiny files” use case. Yeah it takes a while, but it’s only one time for the lifetime of the setup. It’ll be done probably before we’ve finished talking about it, certainly before any patch is deployed, and then it’s a non problem.

A while is literally days, and not 2 days, the initial sync of my files (1.7M files - 163GB) took around 2 weeks to be fully completed. Consider that we’re talking about a office server and the people is just connected on working hours. And I don’t consider that my repo of files is big…most files are less than 64kb

It’s not either the first day only, if I add a bunch of files (Imagine a project files repo, that can have a range of thousand files) it will also take a long while till everything is there.

If you still feel it’s an actual problem you can disable fsync for the duration of the initial sync and then enable it again. Also makes it a non problem.

Don’t misunderstand me. I’m not opening this discussion to know that I can handle it with workarounds. My intention was more try to contribute within my possibilities with my user experience. I think it’s an excellent tool, and concretely this matter of long sync of tiny files is not an uncommon issue in the community.

Answer to AudriusButkevicius:

I think fsync(file) on ext4 effectively is sync (sync whole drive).

My use cases are on both EXT4 and BTRFS

There is fsyncdata which you can call in different modes (async after each block write, sync on file close), but you still need to fsync the directory in which the file is created, leading to the same problem.

At the end, there is a point where all agree: fsync, fsyncdata or sync is required. Then maybe the option is try to minimize the number of times we call it in the safest way possible.

AudriusButkevicius · November 4, 2020, 10:27pm

At the end, there is a point where all agree: fsync, fsyncdata or sync is required. Then maybe the option is try to minimize the number of times we call it in the safest way possible.

We call it once per file now, how can you minimise it even more? The only way to minise it is not call it.

joseluis · November 4, 2020, 10:35pm

My initial proposal is (on high level):

1.- Start a “transaction”: create the transaction register in the DB and the list of files related to that transaction.

2.- Receive n number of files (let say 100, 1000, whatever, fixed value, or dynamically calculated depending on the number of files + size)

Note: Once all the files are received, the sender is not needed anymore

3.- Run fsync for all files (this will take time but the sender can be gone for now, we don’t need him).

Note: if 3 can be improved instead of fsync per single file but other way, even better.

4.- Write in the DB the required registers for all the received files

5.- Mark the transaction as done

If, during this process, you have an issue (i.e.: power outage, process killed, whatever), then, when syncthing is restarted will detect a transaction unfinished, then it has to cleanup the garbage (all the files related to the transaction of the pieces of them, registers in DB if any, etc) and re-request the info.

It’s just an idea, if it’s not feasible technically then we leave it on a side. Edit: However I really think this is a point where any new ideas of how to improve it are welcome

AudriusButkevicius · November 4, 2020, 10:40pm

Running fsync for all files will take the same amount of time.

joseluis · November 4, 2020, 10:48pm

Well, that’s the second part of this, as I said, the preferable option would be to have a single ‘sync’ to disk and not per file.

But that can come later, and the bulk transaction method will be already there.

And, assuming that the only option is a fsync per file, then you have the advantage that you can receive big sets of info from the sender in less time and leave him less occupied while you do the sync on the receiver side.

I.e.: if you calculate the n number of files to receive dynamically based on the file size, you can receive a chunk of tens of thousands of small files in a moment.

I just run a test in my setup, with disableFsync, and 1k files (small files - 1kb). I received all in 2 seconds and then the manual sync (from command line) took nothing, not even a second.

If I do it in the normal way it takes at least 100 times more (didn’t measure the exact time, I can do it if you want anyway).

AudriusButkevicius · November 4, 2020, 10:49pm

The bulk transaction part is already there. The bulk sync is not posix, does not exist on some platforms, and does not work on some filesystems.

joseluis · November 4, 2020, 11:02pm

Is there in the latest version? how can I give it a try?