RFC: Use DB instead of in-memory lists in puller

tl;dr - the proposed change:
Drop the lists, maps and queues for files and deletions. Instead iterate the needed files three times (as in the number of iterations/passes doesn’t change to now): 1. Create all directories and take some shortcuts (symlinks, …). 2. Deal with the files (copying/pulling). 3. Delete everything. The queue remains just to track files in progress and files that the user pushes to the front.

Some more words on why, open questions, …:

I have an almost finished change for this, however as usual the last 10% are still quite a bit of effort, so if I can get conceptual feedback now that’d be cool (I already have 10% of the last 20%, they were already somewhat painful xD ).

The way we pull annoys me since like forever. And we used to have a lot of complaints about high-memory usage on first sync. I don’t think there’s many of those anymore, maybe low-memory devices just became less common? Anyway, it doesn’t change that it feels combersome and pointless that we store huge lists of work to do in memory while pulling. I presume it’s meant to speed things up by avoiding additional DB iterations, however I have my doubts that was ever a meaningful optimisations - the puller operations themselves aren’t light, involving filesystem calls, I kinda doubt a DB iteration is a bottleneck here (even if not on very fast storage).

Now of course I am just saying “I don’t see this mattering performance wise”, I don’t have watertight reasoning or benchmarks to prove that. I do really think that if we can iterate the need table once per pull operation without a terrible performance impact, iterating thrice can’t be that bad, but that just exemplifies the “rigor” of my arguments. And I think the outcome depends on a truckload of variables: Speed of storage where DB resides, what is being pulled (how much, what kind of changes, …). Clearly the less work there’s to do, the worse the impact of my change (as DB iterating is more or less the only thing to do), however then there also isn’t anything/much to iterate so it shold still be very fast/light (unless we iterate the entire table, not just needed files - gotta check table/index setup).

I think we used to have some benchmarks in the integration tests, that may have shown something here, but they are defunct afaik. And I’ll hardly have time to revive them. Any other ideas how to benchmark?
If we deem this risky (I think it will be), I could add it in parallel to the existing logic. Then we can first let volunteers test, then maybe all RC users. Then again, that’s what RCs are for anyway.

And I didn’t want to push a draft PR yet, not mainly because it’s messy (I could live with people seeing my git mess :upside_down_face: ), but because obviously when I do a large change like this, I find unrelated stuff that looks unexpected/weird, and so far I just either opinionatedly or randomly papered over those things. However they seem impactful or distracting enough that I want to separate them.

3 Likes

I’m all for doing this. In fact I had a plan for skipping the in memory part and doing it directly from database as part of the v2 refactor, but that didn’t happen for reasons.

There’s also a halfway related new thing in that folders are now “waiting to sync” for a while after an index update to let things settle, but it’s a quite short interval, some fraction of a second, but it keeps things from starting pulling loops all the time during an active/initial index transfer.

Anyhow, some complications I remember you need to deal with, in no particular order,

  • My initial idea was to just do it in batches now, since the database does ordering and we can ask for a batch with ... LIMIT 50 etc. However we need to handle renames by finding matches which means iterating all the needed files, that’s fine, we can do that before we start pulling?
  • we provide API methods to see the queue and reorder it. I’m quite interesting in getting rid of the last bit, but it does fulfil a useful purpose
  • I’d like to instead have a more fine grained way of ordering pulling, something like ignore patterns but with priorities, so I could say that a certain file extension always gets queued first
  • ideally this should take effect when updates happen in the middle of a pulling iteration…

The prioritisation stuff is a separate feature to be sure, but it kind of falls naturally out of wanting to not have an in-memory queue of things.

1 Like

I was wondering about this “waiting to sync” state, as it keeps popping up also for folders that are unshared… which didn’t make much sense :zany_face:.

2 Likes
2 Likes

I think we could keep the “bring to front” functionality? Just have a table with priorities that you join as you run your query to get the next batch, do a left join and nulls last?

You could also have patterns for priority and insert priorities based on line number in the patterns file.

Also, I’d run the query on every pull with limit 1, which means the queue would refresh nicely.

1 Like

I think I am doing something a bit different from what you have in mind, Jakob, at least none of that falls out from what I do. Or I am just not seeing it. And it is keeping the bring-to-front stuff, I just kept the existing queue for that and the in-progress tracking, not for the “all the files to track” part.

Also not showing the code just because it’s nowhere close to being ready wasn’t the smartest idea, feels a bit silly to just talk around it, so here it is in case you are curious. You might wanna look at the diff without the first few commits, which just move code around (and thus make the overall diff shit - as I said, nowhere ready): draft: use db instead of storing items to be handled in memory while pulling by imsodin · Pull Request #10183 · syncthing/syncthing · GitHub

1 Like