Implementing case insensitivity

calmh · October 31, 2018, 7:12am

So the database layer is now at a point where case insensitivity is implementable. That’s good. The mechanics of the rest isn’t rocket science, I think. But there are a couple of usage points I’m not sure of… So here’s some out loud thinking.

Switching a folder between case sensitive and case insensitive mode

We need to somehow support changing the “mode” of a folder. Currently they’re all case sensitive. The way I envision things in the case insensitive case is that the database keys are in fact case-squashed versions of the file name. This means that converting a folder requires rewriting the database with changed keys. This is quite intrusive and difficult to do while the folder is running. Options;

Attempt to implement on the fly conversion. This will probably be a major undertaking and I might fuck it up.
Allow changes, but require restart for it to take effect. When starting up we check whether the desired state of the folder matches the database, and if not we rewrite the database before starting the folder. This is simpler to do, although still not trivial.
Disallow changes to case insensitivity after folder creation. You must remove the folder and re-add it with new options, which necessarily entails rescanning and stuff. Trivial to implement, but not especially user friendly.

Is option number three enough to get this off the ground? Or do we need to aim for at least number two initially?

Communication between devices with differing case settings (for the same folder)

This won’t work very well at all. Assuming we allow it, a setup with one side case sensitive and the other insensitive would result in pretty much all the bugs and strangeness that we have today. Options;

Run like today, but with a warning. Nobody reads warnings.
Stop the folder when the mismatch is detected. Avoids the issue, is more noticeable.
Disconnect the other device when the mismatch is detected. As above, but even more annoying.

I think number two, here? There’s some implementation trickery around being able to stop a folder (and keep it stopped) based on connected device state but it’s probably solvable.

Allowing case sensitive mode on case insensitive systems

We can detect whether a given folder is on a case sensitive or insensitive FS by probing, so lets assume we know how it should be configured. Should it be allowed to run a folder in case sensitive mode on case insensitive filesystems?

If yes, doing so will run into all the bugs caused by this that we have today. That’s not awesome.

If no, what do we do about all the existing systems today that do exactly this? We can recommend changing, but doing so automatically would break lots of setups due to differing setup between devices above…

Defaulting to case sensitive on case sensitive systems

If we are on Linux/BSD/etc with a case sensitive FS, should we default to case sensitive or insensitive folders? Case sensitive makes more sense for Linux-to-Linux setups, case insensitive is the only thing that would work with Linux-to-Windows etc.

I’m leaning towards always defaulting to case insensitive regardless.

What to do about case conflicts on case sensitive systems in case insensitive mode

Yeah I don’t know. We need to detect when scanning that there are conflicts and somehow handle those. Blacklist all case versions of the file? Stop the folder?

AudriusButkevicius · October 31, 2018, 8:35am

Given I feel we should go for 3 in the first section, 3 could also be ok in a second section, where you can work around it by unsharing the folder with the device in question.

I think we default to case insensitive everywhere by default with an advanced option to make it case sensitive. For existing setups, we cut a 0.15 and force everyone to rescan in a case insensitive mode (or migrate the db for them, if we can?). We make case sensitive folders on insensitive fs’es a failure to start folder going forwards.

For conflicts, we probably just pick one file of the two based on string sort order and be done with it.

imsodin · October 31, 2018, 10:57am

What would even be the advantage of 1.? And as for whether 2 or 3 is required: That depends on the default: If we default to case insensitive 3 is possible, otherwise I think 2 is a must. Otherwise you need to know about this topic at folder creation time on a case sensitive system, and decide whether you will share with a case-insensitive system. That won’t be the case for the “normal user”.

I always imagined this to be a completely opaque cluster wide folder setting, where if a system shares a folder with any case-insensitive system, it switches to case-insensitive itself, thus propagating to all devices. Meaning no warnings, no stopped folders, definitely not allowing case sensitivity on case insensitive systems and defaulting to what is native to the respective filesystem.
I can see why one wants to make this more transparent, giving the user control. I am still in favor of 1. Yes, nobody reads them, but equally true nobody looks at folder state. Syncthing keeping to sync even on errors as long as possible is a big feature in my opinion. So if there is no automatic switch, I’d just go with a warning telling the user to switch the setting. Or add option 4: Keep syncing except to device with opposite sensitivity setting. However that’s likely a nightmare.

I am very much against this for the reasons stated. If the default will be case insensitive, there’s no problem with transition: Just do it if we implement it or cut 0.15 as Audrius said and do it the “hard way” (full rescan) on upgrade.

For the reason stated in the beginning I believe we should anyway implement db transitioning, so there’s not much harm in defaulting to case sensitive on case sensitive systems. I can already see and somewhat sympathize with the backlash from users who don’t even get near windows and are annoyed by having functionality restricted (or to avoid restriction, annoyingly having to repeatedly (remember to) set an advanced config option) because of it.

I believe we should store them, and display them in the web UI. Minimally just as a list with a note at the top to the user that they should manually resolve it (delete one). Ideally displaying both sizes and mtimes and allowing to select which file to keep in the UI.

User presentation (and invalid characters/reserved filenames/…)

If this is exposed to the user, couldn’t we call it something like Filename Compatibility Mode. One I am thinking that “case sensitivity” won’t be understood by many, but the notion of restricting functionality to the common subset supported by all systems could be. Maybe I am completely off here though. And in addition it would then make sense to group other issues behind that setting like invalid characters (i.e. instead of failing on windows when pulling them, fail when scanning - thus the error is on the device where action is required). However I am thinking about the naming here, not intending to feature creep.

calmh · October 31, 2018, 11:42am

Lots of good points above that deserve thoughts and response that I don’t have the bandwidth for atm, so just a couple of notes.

Both you and Audrius envision a forced migration in a 0.15 step. I don’t like that as we would be breaking setups that depend on case sensitivity today, guaranteed. There’s no way for us to know up front if a given device on Linux needs sensitive or insensitive mode, both are equally valid even if one is more common. The only safe action is to not change the current folders - “first, do no harm”.

I like this, as a more user friendly abstraction of the whole issue. We definitely need something clearer than “case sensitivity”. But we can’t completely mix together case sensitivity and disallowed characters. I’m on Mac and must run case insensitive, but I also often use ampersand and colon in my filenames. This is fine in my setup.

canton7 · October 31, 2018, 12:01pm

I’m nowhere near well-versed in the technical side of this, so I’ll describe my ideal scenario as a user. If this is technically hard or completely misses the point, please disregard.

I’m using “case-sensitive folder” to mean “a folder on a device, which is on a case-sensitive filesystem”.

As I user, I don’t want to be given the option to switch folders between case-sensitive / case-insensitive.
If I have a case-sensitive folder, I should be able to create filenames which differ only by case, and have those accurately reflected in other case-sensitive folders.
If I have a file in a case-sensitive folder which is synchronized to a case-insensitive folder, and I create another file whose name differs only by case, that file should fail to synchronize to the case-insensitive folder, with the usual “Out of sync” messages.
If I create two files in a case-sensitive folder whose names differ only by case, I don’t mind whether they both fail to synchronize to a case-insensitive folder, or whether one succeeds and the other fails.
If I have a file in a case-sensitive folder which is synchronized to a case-insensitive folder, and I change the case of its filename, I would not expect that change to be synchronized to the case-insensitive folder.
If I have a file in a case-insensitive folder which is synchronized to a case-sensitive folder, and I change the case of its filename, I would not expect that change to be synchronized to any other folders (case-sensitive or case-insensitive).

calmh · October 31, 2018, 1:11pm

I mostly agree (except I’d have higher expectations on the last two points).

A technical problem is that in Syncthing, the unique thing that identifies a file is the folder ID plus the file name. If we are case insensitive and the peer is case sensitive, we need to maintain their index data case sensitively and then somehow provide a case insensitive “projection” on top of this. That projection would then be responsible for noticing case conflicts…

BenShafer · November 1, 2018, 4:30pm

@canton7 For the last two bullets, I would expect the case change to be reflected on the peer.

canton7 · November 1, 2018, 4:32pm

My reasoning is that you have to jump through hoops to get case-only renames on case-insensitive filesystems to work in basically every other tool on the planet (e.g. change to an intermediate name, sync, change to the final name, sync), so I don’t want to put a requirement on Syncthing that nobody else adheres to.

BenShafer · November 1, 2018, 5:01pm

@canton7 I know it’s a pain, but it’s important when syncing files in a working directory of a git repository (don’t ask why we’re doing that).

canton7 · November 1, 2018, 5:02pm

Git has this same restriction on case-insensitive filesystems.

Note that I’m not going to be implementing this, and I’m by no means describing what the behaviour will be. All I’m doing is laying out what would be my ideal scenario, as a user.

If you have your own requirements, by all means state them, but there’s no benefit in trying to change what my expectations as a user are.

alessandro.g89 · November 2, 2018, 2:30pm

TL;DR: my proposal is to have “case-insensitive” as the default, make “case-sensitive” an option for people who really want/need it, block synchronization between nodes that have a different case-sensitivity setting for the same folder, and use an “undefined” state during the transition (which means “keep current behavior”).

Full version:

I like the “first, do no harm” principle. I would suggest to use it this way to encourage a gradual migration:

Allow 3 settings: case-sensitive, case-insensitive, undefined (last one can’t be set by the user, it is automatically inherited by pre-existing folders, and it means to behave like today).
Warn users (via the UI, changelog, everywhere) that they will have to set this option for all their folders. The warning only goes away when there are no more “undefined” folders.
Newly created folders can only be “sensitive” or “insensitive”, with “insensitive” being the default (i.e. assume the worst, that they might talk to a case-insensitive system) and “sensitive” being only available on case-sensitive systems.
Only allow syncronization between nodes with the same setting: sensitive-sensitive, insensitive-insensitive, undefined-undefined. Disallow mixed configurations (show warning, pause folders, whatever).
At some point in the future, remove the “undefined” state and change it to “case-insensitive” (again, for the “assume the worst” rule). Give enough time and warnings before doing that.

Result: in the long run, “case-insensitive” will be the default and “case-sensitive” will be an option for users that really want/need it. I suspect they are the minority (but I can be wrong on this). This avoids all the headaches of dealing with heterogeneous configurations and all the corresponding corner cases.

Note that this proposal goes against my own interests (I only use Linux and I would prefer the case-sensitive behavior), but I think it’s better when looking at the big picture.

calmh · November 2, 2018, 2:49pm

So combining your idea of “undefined” (maybe “legacy-sensitive”?) with @imsodin’s cluster wide infectiousness might yield a reasonable transition mechanism. That is, “undefined” folders could adopt the mode of the peer, if the peer is not undefined. Assuming we can handle database transitions on the fly, which we probably must.

But the cleanest option would be to not even have to make the choice, what @canton7 is suggesting. Perhaps every folder could initially be case insensitive. When it runs into a local case-only naming conflict during scanning it switches to case sensitive mode and sets a bit on the conflicting files to indicate that they must be treated case sensitively.

Devices that receive such index entries can then treat them case sensitively (if on a system where that’s doable) or flag them as sync errors…

alessandro.g89 · November 2, 2018, 3:58pm

That’s actually a good idea, I didn’t even think to handle the issue per-file. This method would nag users only when it’s really necessary. I would be totally ok with that.

imsodin · November 2, 2018, 6:44pm

In the per-file/no-choice solution I like that there is no option at all. I do not so much like that errors do not appear on the device of origin. Meaning you need to get control of a remote case-sensitive device to fix them. Also implementation wise having to check for case conflicts on all pulls and scans seems hairy.

calmh · November 2, 2018, 7:26pm

I’m imagining the database always having case insensitive keys and the value being either a FileInfo (all good) or a slice of FileInfos (sharing the same case folded name; case conflict bit implied). The scanner would need to be not Walk() based but listdir based so it can determine case conflicts ahead of time. On the pulling side the case conflict bit would mean the file is either handled as usual (because we can) or failed (because we can’t).

One obvious complication is deciding when to move back to the non-conflict situation. It’s basically the same as when do we remove entries for deleted files.

I’m sure there are millions of other corner cases too but it feels doable from where I’m now after a Friday evening bottle of wine

And yeah, no error on the source side. Although, we could rig a warning for having case conflicts, and we could have devices announce whether they are in fact case sensitive or not (to be able to emphasize the warning), and there is the out of sync indicator for remote devices.

canton7 · November 2, 2018, 8:28pm

Agreed no warning on the case sensitive side is something less than desirable… However I think anything should be a warning only, and shouldn’t stop successful synchronisation with other case sensitive devices.

imsodin · November 2, 2018, 9:21pm

How does “case sensitive mode” work with always insensitive keys to lists of FileInfos?

calmh · November 3, 2018, 6:32am

The same, I figure. That is, there is no “case sensitive mode”, there is only case insensitive and special handling of conflicts. These conflicts won’t happen on case insensitive fs:s, and on case sensitive systems they are bound to be unusual enough that a slight overhead in handling them doesn’t matter.

In fact, that could be implementation step one - change the db keying and add the handling for case conflicts and settings the relevant new fileinfo bits. Nothing should change in the behavior. Next step would be for case insensitive systems to realize that reception of entries with the conflict bit means that file (in all case variants) is now invalid…

I’m still not sure how to ever safely “untaint” a file that has had case conflicts, though. Maybe once all variants are deleted they could be coalesced into one normal case insensitive delete entry, and case insensitive systems would need to understand that a delete entry with the conflict bit set does not in fact conflict with another delete entry without the conflict bit, or something.

imsodin · November 3, 2018, 6:27pm

Ok, got that now. Next question:
Why do even even need the new “conflict bit” on file infos? As far as I understand, when conflicts need to be handled anyway, this means that whenever comparing/replacing/updating/… fileinfos, filenames need to be checked whether they are equal (with case). So we always know whether there is a case conflict or not, no need to set a specific bit on it.

Do we have to “untaint” at all? It looks like not just similar, but the exact same problem as with deleted files, meaning we need to keep it anyway (maybe related to me not getting the significance of the conflict bit).

calmh · November 3, 2018, 7:14pm

Yeah, so about that… Let’s pretend we are case insensitive, and our file identifier / primary key in the database is then the case folded (lowercase) filename. Case only renames are not a problem - the filename is a piece of metadata like any other (permissions, modtime, …) so we just need to take care of the case where an update comes in with the same key (lowercase filename) but a different filename and do a rename on it. Easy. Case insensitive devices won’t send the “usual” delete + new file pair, because the deleted file and the recreated one would have the same name (in a case insensitive world).

But what if the other side is case sensitive, and there is a case conflict? They’ll send us an update for a filename that differs in case only. How do we know it’s a new file and not just a case-only rename of the file we already had? This is where I want that other device to send an update with the case conflict bit set, to indicate that a new file has materialized that shares the case insensitive name with another file.

Maybe. The teensy difference between the two is that there is no difference, in my scheme, between saying foo.txt and FOO.txt are two deleted files and foo.txt (case insensitive) is deleted. It amounts to the same thing - nothing called foo.txt (in any variant) - so we can simplify to the latter and forget there was a case conflict at one point. And we need to do the untainting, because if - after these deletes - the file is recreated, it can be synced by case insensitive devices. As long as there is just one variant.