Syncthing retransmits large files on rename

ulgonia · April 24, 2020, 8:29am

OK sorry, I didn’t understand at first reading. So it seems it is something like a message ordering problem, right?

Do you mean you’d be able to include an improvement for this in a later version? You’ll probably find this very naive, but in my understanding, you could always know that you have a rename operation at the source (e.g. doing some kind of fingerprinting of the files, which you probably already do, or even through the Filesystem Watcher). Then the problem should be that you first inform the other parts of this operation, even if you send the data in several blocks afterwards?

Please don’t think this remark would be pretentious - I suppose you probably have already thought many times about all this I really just would like to understand the process.

Thanks

AudriusButkevicius · April 24, 2020, 1:46pm

Everything is a trade off.

When we detect a new file we notify the remote peers of it pretty must immediately, which is sometimes even quicker than we detect that a file with the same content has disappeared, which means that remote side starts effectively copying the file (which is not a rename, but still, not a redownload as you claim). This is what the watcher tries to optimise for in the worst case, by effectively delaying notifications for deletes.

We could not send anything until we’ve scanned everything, check if we match creations and deletions and conclude they are renames, but then for large folders you’d have to wait minutes for this to complete for remote peers to start doing any work. Also, you have to store the result of the scan somewhere to do the matching, and we have folders with terabytes of data and millions of files, so you can’t just store it in memory.

We could have some sort of fingerprint as you say, and if we see a file with the same fingerprint already exists, hijack the scan, check that file first, and if that disappeared conclude it’s a rename.

But then you have to store the fingerprint for every file, check for that fingerprint on every new file during a scan, and for people with a lot of files that will not be for free.

Theoretically it could be improvements in later version, but I am not aware of anyone working on this.

ulgonia · April 24, 2020, 3:42pm

Thank you for your analysis. Just to be clear I don’t claim there is a redownload every time, but it seems to me that the process lacks a bit of transparency, so that it is hard to say what is really happening with the files.

I understand your point, it is clear to me that sync processes are very complex, and covering all cases and providing high quality service is difficult. Syncthing seems to belong to the best ones as far as I could see!

Would be great if someone had time to work on this point anyway, as it looks like there is still a possible margin of optimisation: especially in the case of large and/or numerous files, the local copy can become a big drawback as you need twice the space on your server for this to succeed.

Cheers

AudriusButkevicius · April 24, 2020, 4:12pm

I don’t really see a margin of optimisation not causing sacrifice somewhere else.

We have people not happy sync process takes a few seconds to start and is not instant, while equally we have people not happy that renames are not always detected, equally we have people not happy that scans take long or take up memory.

We’ll just shift from one crowd to another telling us world could be a better place.

You can’t win in every situation, and I wouldn’t hold your breath on improvements here.

cosas · April 25, 2020, 8:32am

Hi

Is there a handle which checks, potentially even when a download is already started, that same blocks are already available in other files (potentially even in currently downloading other files, potentially in other folders) ?

imsodin · April 25, 2020, 8:43am

Of course it checks for existing blocks, that’s kind of the “core business of Syncthing”: https://docs.syncthing.net/users/syncing.html. It does so before downloading (because what’s the point afterwards, the block is downloaded anyway).

ulgonia · April 25, 2020, 9:16am

@cosas: as far as I understood what @AudriusButkevicius explained, it would rather be a copy and not a download. However,the idea of interrupting the copy process and rename the file instead is really interesting! It seems like it could bring an improvement without having to reorder the messaging and though create new delays which could be indeed inacceptable.

But I still don’t understand the way those messages are triggered. In my very simple setup it is clear and also 100% reproduceable, that the rename is always correctly handled from the linux server to the windows machine, and always seen as a copy in the other direction.

Even if @AudriusButkevicius gave me a very good explanation of the general process and how it could end up with a copy rather than a rename (thanks again to him!), I’m actually still wondering why I get systematically this particular behaviour

Cheers

imsodin · April 25, 2020, 9:21am

That’s almost certainly due to the different underlying filesystem watcher implementations (like base system, nothing in syncthing or even Go). There is an active thread that showed that windows notifies on parent directory, not the change files themselves. If you want to see what exactly is going on, enable scanner and model debug facilities.

ulgonia · April 25, 2020, 9:28am

Thank you for your reply. I will give the debug a try.

cosas · April 26, 2020, 9:48am

Maybe if ST was able to search for blocks in the local Folder’s trashcan when it exists, this would improve “Saved by reusing from ~somewhere~” and alleviate network’s burden. First step in own trashcan, then trashes of the other local Folders.

This would be a workaround for scenarii when it was not possible to postpone enough the delay before deletion.

imsodin · April 26, 2020, 10:08am

That’s been discussed many times, please have a look at existing topics (it has advantages as you point out, but also drawbacks and difficulties) and as far as I see, has nothing to do with what’s discussed in this thread.

AudriusButkevicius · April 26, 2020, 12:22pm

You can always add the trashcans as “unshared folders” to get this “feature”.

tfarides · May 16, 2020, 1:50pm

What about treating large files differently? Say anything over 500MB (or configurable) gets a fingerprint? Where as the quickest sync method is used for smaller files, such as documents or whatnot? The fingerprinting could even be a lower priority process or optional?

Just some thoughts…

AudriusButkevicius · May 16, 2020, 2:08pm

Some code landed in 1.6.0 which might make this better.

xerusf · September 29, 2021, 4:56pm

What about from the archive? I just had a huge file rename going on and syncthing threw all those files into the archive before starting to redownload them. Will it check whether any file in the archive matches the checksum of an added file and move it from there rather than redownloading it?

Unfortunately in my case I deleted the folder in the archive because disk space ran out, suggesting that this is not done…

AudriusButkevicius · September 29, 2021, 6:02pm

No, versions directory is invisible to us.

Even if its in the versions directory, we could not “rename it”, because we’d lose the versioned file.

We would end up having to copy it.

You can however achieve this by adding a syncthing folder pointing at the .stversions directory, and not sharing it with anyone. This will end up reusing blocks from this folder due to cross folder block sharing.

xerusf · October 22, 2021, 6:01pm

oof, this is a basic feature I would expect out of the box

also, renaming does make sense, as the file would not have been archived if the rename had been detected in the first place

calmh · October 22, 2021, 7:17pm

Renames within a folder are optimised, but if you have versioning enabled that needs to become a copy, because we need the file in two places now.

Renames between folders are not optimised because folders are separate entities with their own schedules, policies and (often) file systems.