How is the file tree stored inside the database?

bxff · November 14, 2022, 6:30pm

I am trying to create something somewhat similar to Syncthing, primarily I am struggling to understand how can I store file directories in a database efficiently.

From this article it seems like a flat tree is stored using a relative directory(e.g. ‘photos/etc/something.jpg’): Syncthing — How Scanning Works — Kastelo Inc.

If it is stored in such a manor how is a change in the root directory reflected onto the database? For example if I rename the photos directory with a lot of sub directories, then all of the sub directories path would need to change. I’d reckon this is not how things work, but then how do things work?

Thanks.

gadget · November 14, 2022, 7:16pm

For Syncthing’s internal database schema, the best place (and most definitive) resources are the documentation and source code:

With regards to the general question of storing directory path names efficiently in a database, a lot depends on the type of database being used and how compact it needs to be.

For example, a traditional RDBMS with rectangular tables consisting of rows and columns where the goal is for the database to be fully normalized, you’d store each parent directory only once with a unique immutable ID (with the directory name just a changeable value) and each child directory referencing its parent by its ID rather than by name. Each child directory’s name is also simply a changeable value.

Some modern file systems use a B-tree for their directory structure:

calmh · November 14, 2022, 8:16pm

That is indeed how things work.

bxff · November 15, 2022, 3:02pm

Thanks for the information! The example you described seems very similar to parental reference technique recommended by MongoDB: Model Tree Structures with Parent References — MongoDB Manual

B-tree is seems something very interesting to look into.

bxff · November 15, 2022, 3:07pm

Thanks for the quick reply, does this scales well for nested sub directories? Wouldn’t this would very expensive if the root have millions of nested sub directories(e.g. if I synced my node_modules).

imsodin · November 15, 2022, 4:28pm

That’s not a concern. Easy answer: Because no matter how inefficient, doing such large changes will be much more costly still at the protocol/application layer. And secondly because this isn’t that inefficient of a db operation for a leveldb (and I think most kv DBs work similarly at the lowest level): Delete (well mark deleted) a contiguous section of keys, and add a new one.

calmh · November 15, 2022, 5:49pm

Expanding more on that, we don’t know it’s a directory rename until we’ve scanned and hashed all the contents of the new directory and compared it with the old. At that point the cost of changing records in the database is moot.