Can adjusting these advanced folder options potentially reduce fragmentation?

rubber_gun · May 17, 2021, 4:00pm

Hello, I have read a few threads on people saying syncthing’s files are slow to access presumably because they were fragmented. Now I have one data point confirming the above but it is purely anecdotal and not very objective. Wikipedia defines three types of fragmentation (Free space fragmentation being outside the domain of syncthing)

File fragmentation: file not incontiguous.
File scattering: related files not being grouped together.

I have identified the following advanced folder options as possibly conducive to reducing fragmentation based on discussions in several threads. Please let me know if I’ve done my homework correctly.

blockPullOrder

tomasz86 says use "inOrder here. Syncthing and defragging, also missing files - #12 by tomasz86. Has the potential to affect “file [level] fragmentation”, the first of the three types of fragmentation.

blockPullOrder: My understanding

blockPullOrder controls how a folder will distribute portions of a file to connected peers. (Or I am misinterpreting it and it decides how this folder will fetch files from peers). Regardless…

Default

New file exists locally, one or more (N) peers wish to download it. Seeder will carve up this file up into N blocks and give peers a piece. Everyone pulls a piece, then randomly pulls another piece from each other. Fragmentation (for peers): scales upwards with number of peers. The more peers the more their files will be broken up into blocks of 1/N of the file. This means there will be no fragmentation if there is only one peer.

Random

Like the above but instead of carving file up into N blocks based on number of peers, blocks are distributed randomly. Fragmentation (for peers): maximum fragmentation, all files broken up into blocks of random order.

inOrder

File distributed sequentially to peers. Fragmentation (for peers): Mimimum. Everyone pulls a file sequentially.

blockPullOrder: Conclusion

inOrder will cause the least fragmentation. But people who want to get a file pushed out as fast as possible and/or their connectivity is unreliable may find inOrder to slow the syncing down since there are reduced oppurtunities for peers to share unique blocks between each other. There may also be disk strain for the seeder because of the above, from not being able to outsource disk seeks to other peers.

copiers and pullers (hashers not relevant)

The documentation says these are advanced options and state not to touch them. Copiers also seem relevant w.r.t. file-level fragmentation.

Tyrindor uses copiers=1 in “sent through Syncthing have very slow reads from disks?”. This advice is quoted from calmh saying to set copiers=1 in response to “Slow sync with large files” (not related to fragmentation)

copiers and pullers: My understanding

Copiers

Copy data blocks from one file to another (e.g. “reused” files in syncthing terminology"). Presumably the more copiers working together simultaneously, the greater than chances of file fragmentation. 0 being the default, which is explained by calmh as equal to 2 although earlier in 2015 calmh stated the default was 1 and there was no reason to have more than one

Pullers

Request missing blocks from the network. Default being 16 as explained by calmh. I don’t understand fully pullers but they seem a network-related optimization and not so much related to putting files on disk and it’s best to leave this untouched.

copiers and pullers: Conclusion

Copiers=1 seems reasonable in reducing file fragmentation but tuning pullers above my pay grade, defaults are probably there for a reason.

maxFolderConcurrency

controls how many folders may concurrently be syncing or scanning, defaulting to the number of cpu’s in the system.

maxFolderConcurrency: My understanding

If multiple folders are syncing simultaneously, then that could potentially mean multiple files are being simultaneously written. Presumably setting this to a lower number could possibly reduce fragmentation, with 1 being an extreme. But that would severely impact the syncthing experience if only one folder were allowed to sync while all others waited on it. Firstly one would presume that filesystems would be smart enough not to weave two files together bit by bit. With what calmh says here in mind this seems like it is delving into deeper os/filesystem-level things like write caching and so on. It seems unwise to change this without a deeper understanding of how the OS, filesystem, the storage subsystem interact. Since I don’t have this knowledge the defaults probably are OK.

maxFolderConcurrency: Conclusion

Keep the defaults

order

The order in which needed files should be pulled from the cluster. (random, alphabetic, smallestFirst, largestFirst, oldestFirst, newestFirst). This seems most related to the third type of fragmentation, file scattering.

This seems pretty self-explanatory. Syncthing, when pulling files that it needs, will sort what it pulls based on the above choices. Caveat: as stated in the docs that this refers to files syncthing already discovered, e.g. selecting for smallestFirst may pull a 1GB file ahead of a 2GB file, but it later turns out the rest of the folder is entirely 1MB files, just that they haven’t been picked up yet.

order: My understanding

Reduced utility of adjusting this due to the above caveat aside, outside of file fragmentation being the hardest on disks, file scattering should rank second highest. But it seems almost impossible to tune for without taking into account the myriad ways the user or the application uses files.

Are small files frequently accessed together? (smallestFirst/largestFirst). Are files accessed alphabetically? (alphabetic). Are older files left untouched and and newer files frequently updated? (oldestFirst to reduce free space fragmentation / newestFirst to short-stroke newer files in an area of higher rotational velocity on the disk).

Regardless of syncthing, it seems inevitable that given enough time files in a folder may not be near each other on disk and there’s no way to control the correlation between the the folder tree and location on disk. There doesn’t seem to be an ideal way to lay out files on the drive at all.

order: Conclusion

Keep the default of random

Final conclusion

blockPullOrder = inOrder, reduces file-level fragmentation, may cause disk strain on sending device, may increase time to reach in-sync status. Most useful option to tune.
copiers = 1, reduces fragmentation at the cost of slowing sync time maybe
pullers = don’t touch
maxFolderConcurrency = could reduce fragmentation, best to let the OS/filesystem handle this
order = random (default) – has the potential to impact how files are scattered but impossible to predict usage

AudriusButkevicius · May 17, 2021, 5:21pm

So pullers controls number of outstanding network requests (there is also “number of bytes outstanding” limiter). But in theory, you could request 16 different blocks, they arrive out of order and we write them out, out of order.

I am not sure that would cause issues (or any of the modes in general for that fact), because we pre-truncate the file to its final size on start, so the filesystem knows how much contiguous space we will need. How/if it passes that down to the driver, that’s a separate question.

Copiers effectively controls how many files we might be working on in parallel, including locally copying blocks between two local files, potentially causing seeking, but we expect that part to be qquick

This to avoid head of line blocking on some one slow source which is blocking the transfer of the first file, so for the second file we see alot of outstanding requests against that device and send our requests elsewhere atleast making progress on the second file.

system · June 16, 2021, 5:22pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.