tl;dr - the proposed change:
Drop the lists, maps and queues for files and deletions. Instead iterate the needed files three times (as in the number of iterations/passes doesn’t change to now): 1. Create all directories and take some shortcuts (symlinks, …). 2. Deal with the files (copying/pulling). 3. Delete everything. The queue remains just to track files in progress and files that the user pushes to the front.
Some more words on why, open questions, …:
I have an almost finished change for this, however as usual the last 10% are still quite a bit of effort, so if I can get conceptual feedback now that’d be cool (I already have 10% of the last 20%, they were already somewhat painful xD ).
The way we pull annoys me since like forever. And we used to have a lot of complaints about high-memory usage on first sync. I don’t think there’s many of those anymore, maybe low-memory devices just became less common? Anyway, it doesn’t change that it feels combersome and pointless that we store huge lists of work to do in memory while pulling. I presume it’s meant to speed things up by avoiding additional DB iterations, however I have my doubts that was ever a meaningful optimisations - the puller operations themselves aren’t light, involving filesystem calls, I kinda doubt a DB iteration is a bottleneck here (even if not on very fast storage).
Now of course I am just saying “I don’t see this mattering performance wise”, I don’t have watertight reasoning or benchmarks to prove that. I do really think that if we can iterate the need table once per pull operation without a terrible performance impact, iterating thrice can’t be that bad, but that just exemplifies the “rigor” of my arguments. And I think the outcome depends on a truckload of variables: Speed of storage where DB resides, what is being pulled (how much, what kind of changes, …). Clearly the less work there’s to do, the worse the impact of my change (as DB iterating is more or less the only thing to do), however then there also isn’t anything/much to iterate so it shold still be very fast/light (unless we iterate the entire table, not just needed files - gotta check table/index setup).
I think we used to have some benchmarks in the integration tests, that may have shown something here, but they are defunct afaik. And I’ll hardly have time to revive them. Any other ideas how to benchmark?
If we deem this risky (I think it will be), I could add it in parallel to the existing logic. Then we can first let volunteers test, then maybe all RC users. Then again, that’s what RCs are for anyway.
And I didn’t want to push a draft PR yet, not mainly because it’s messy (I could live with people seeing my git mess ), but because obviously when I do a large change like this, I find unrelated stuff that looks unexpected/weird, and so far I just either opinionatedly or randomly papered over those things. However they seem impactful or distracting enough that I want to separate them.