Versioning & Deduplication

Any thoughts on setting up versioning with deduplication?

I sometimes deal with huge media files, where I edit meta data or do other small changes. I am thinking about running ZFS on my server with ZFS deduplication enabled. (no experience with ZFS at all, would be a first)

1 Like

I would suggest start by setting copyrangemethod on the folder. ZFS dedup is quite memory intensive, and a one way street. Look carefully before enabling.

If you dont need versioning on every host, you might also look into using zfs-snapshots. This might be a lighter on your storage space without the need for activating deduplication, because only new blocks are written.

1 Like

Interesting. As I understand it, this is a way to increase performance. copy_file_range looks like an obvious choice.

We are lucky to have one of the ZFS gurus in our hackerspace, and I asked him yesterday. He echos your concern in that dedplication is extremely heavy on RAM (even if I have a 64 GiB in my server). He suggested using compression instead, then highlighted various ZFS features such as background scrubbing (of course).

Thanks, that’s an interesting suggestion! The disadvantage is that it won’t track every little change, unless there is an option to somehow run snapshots continuously.

I also have an automated Restic backup set up on my server, which runs once per day. So I have some versioning from that, but only at daily granularity.

Example use case when versioning: I edit a document and accidentally delete a section, and there is no undo option. I then just pull the version saved a few minutes before.

Other use case: I accidentally delete a sub directory, and I need to get that back.

Currently, I have all my data inside Dropbox, and the versioning feature gives me peace of mind.

It does increase performance, but primarily in this case it does so by making CoW “copies” of data, ie essentially the snapshot mechanism in ZFS. So more or less deduplication for the data being versioned by Syncthing.

2 Likes

Oh, I see. CoW only copies once the data is modified. I presume this also works on EXT4, as the documentation says it has been tested with that. (I have EXT4 currently on my Syncthing SSD.)

Not continuously but very often, as in multiple times per minute. There are tools to automatically create and rotate snapshots like zfs-auto-snapshot or zfs_autobackup. The caveat on this is, this comes only to effect if the sync to the zfs host was completed.

Guess I’ll ask our local ZFS guru about it. I haven’t decided yet on which method to use for providing a path to undo accidental changes. It’s either Syncthing’s versioning or ZFS snapshots.

I just did a quick test by renaming a directory Zephyr to Zephyr2:

[felix@linux Drafts]$ find Zephyr2/
Zephyr2/
Zephyr2/draft.zep
Zephyr2/draft_files
Zephyr2/draft_files/1674702085844Mesh 1.3db
Zephyr2/draft_files/1674702085844Dense point cloud 1.3db
Zephyr2/draft_files/1674702085844Textured mesh 1.3db
[felix@linux Drafts]$ find .stversions/Zephyr/
.stversions/Zephyr/
.stversions/Zephyr/draft~20240801-044023.zep
.stversions/Zephyr/draft_files
.stversions/Zephyr/draft_files/1674702085844Dense point cloud 1~20240801-044023.3db
.stversions/Zephyr/draft_files/1674702085844Textured mesh 1~20240801-044023.3db
.stversions/Zephyr/draft_files/1674702085844Mesh 1~20240801-044023.3db

What I notice:

  • The directory with the old name gets backed up into .stversions. If the directory is big, and if there is no form of deduplication, then this can be a real issue. Copy-on-write is kind of mandatory to avoid filling up storage quickly.

  • The directory file names stay the same in .stversions, but names of files get prepended with some UID, possibly a timestamp. Restoring a directory looks painful. One would need to rename all the files inside.

IMO Syncthing’s versioning is “poor man’s backup” and your observations are examples of why I think like that. I would provide snapshots. Or look into what different real backup solutions offer, such as Restic.

1 Like

I perceive it as an undo facility. On my Dropbox I regularly use that. If I mess up some file or directory, I can quickly go back to the last version.

If they can be created every few minutes without much overhead, then that may be an option.

See what I wrote above:

A backup is much more heavyweight, not the same as an undo facility.

2 Likes

Since you already automatically back up with Restic, and you’re using Linux, have you considerd NILFS?: https://nilfs.sourceforge.io/

If it’s not already installed, you’ll need the nilfs-utils package for the management tools.

I am considering it now, thank you! :slight_smile:

ZFS snapshots can be run very frequently, but if you need to later list the snapshots it can be a chore to sort through them to find the snapshot you want to pull from.

Another piece to consider is zrep, to copy ZFS snapshots from one ZFS server to another. zrep can run every minute, and the rolling zrep snapshots can be expired on a daily rotation. Its no issue to have manually created and zrep snapshots intermingled.

I do agree this is a kind of serious issue. There is a lot of discussion around here in a thread about more elegantly handling file and folder renames and moves.

As I understand the current solution is a rather rudimentary handling of renamed files in the same directory if and only if those changes are transmitted at more or less the same time. (So this breaks if you rename a very large file that takes more than 60 seconds (?) to scan, as syncthing may delete the old named file before it discovers the source has a new file with the same contents under a new name.

Anyway you really have to consider syncthing handles renames and moves like the old names were deleted and the new names were created and the files just happen to have the same contents.

Anyway that’s another topic.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.