I’ve checked a lot of the posts with the exact same title, but I can’t seem to understand (or fix) why I’m seeing slow transfer speeds with 2 machines on a lan
This is what the UI shows for the two devices that are set up to sync:
To try to speed things up, I stopped syncthing and did an rsync over ssh of the directory between machines. I noticed that it was ~6x faster (by comparing what syncthing had done in an hour versus rsync/ssh). I honestly don’t know what the difference is, but there must be a different approach.
Are you sending files in parallel, or one at a time?
What about batching (tar up a bunch and then leave the other side to unpack and fsync each file, and then continue without waiting)?
Anyway, I don’t want to be a PITA. Just a few ideas.
I explained what the difference is.
rsync does not fsync files, we do.
rsync can get away without fsync, because it’s unidirectional.
Imagine your device is downloading a file, the download is finished, the file is in it’s final place, but not fsynced. Now your device crashes before the file is flushed from the disk buffer, and is now in a corrupted state.
Your machine restarts, syncthing comes up, does a general folder scan on startup, notices that the file is a different size to what it expected it to be, assumes it was modified, scans it, and spreads the corrupted version to every other device.
Taring and what not doesn’t help, you still need to fsync each individual file post untar’ing, which will still be the bottle-neck before we claim “we’re done”, so you will have a spike in throughput, and then the folder will sit “syncing” with zero throughput as it fsyncs everything.
For the initail sync, doing it with rsync or something along those lines probably makes sense (or disabling fsync if there are tons of small files for the initial sync), but it’s advisable to keep it on afterwards, unless you can handle corruptions.
This has probably been discussed before but I presume over time the caches will flush and delayed calls to fsync should return immediately? Is that how it works? Anyone tested whether calling fsync 1000 times immediately after each of 1000 files is completed takes the same amount of time as calling fsync 1000 times after all files have completed and files closed? I.e. if some of the files are flushed while others are still transferring then those fsync calls may return faster.
If so, is it possible to batch fsync calls and database updates in the case of many small files being transferred and still maintain the integrity of the folder?
The POSIX fsync() system call operates on file[s] (descriptors), not the whole filesystem. Meaning you cannot call fsync after a file has been closed (there’s no FD). Also means that fsync only affects the file you’re calling it for, not other files (in the same directory/on the same filesystem).
If you close a file without calling fsync first, the operating system decides when to flush write buffers. Sometimes it may do so immediately (blocking the close operation), sometimes it may decide that it will do so later, speeding up closure, but leaving you at a power loss risk.
How long fsync calls take depends on the cache state - if all data (for that file) is already safely on disk (neither in OS write cache nor on-device cache), fsync has nothing to do and returns (almost) immediately. If data has to be written first, fsync blocks until that has happened.
It is technically possible to do some sort of "batched fsync"ing, but that requires some rethinking. You can fsync entire filesystems (it’s called doing a sync then), syncing all pending writes. You could technically download a whole bunch of files, not commit the database write, then sync all of them at once using sync() and commit the DB write afterwards. Potential problems are that there may be filesystem quirks involved, file batches spanning multiple filesystems and others. It likely requires some redesign to how syncthing commits files.
From tests that I’ve done a long long while ago, I recall that each fsync takes same amount of time, regardless how frequently it was called, and the amount of time was something silly like 2 seconds per call.
I believe you but that’s absolutely ridiculous especially if it’s called 20 or 30 seconds later when the operating system should have had plenty of time to flush the cache and there’s not anything else to write to disk.
Feel free to repeat the test. It was a long while ago. I also suspect it’s filesystem dependant, I think I was testing in ext4.
Whole drive sync would sort of work, or fsyncing ranges (did not check what the performance of that looks like), however some of these are not portable between platforms, so it’s not trivial to just switch over.
We also sync the directories (so presence of the file in the directory is synced). There is some batching for that, but each file might trigger two fsyncs if adjacent few files are not in the same directory.