Implementing case insensitivity

Whether or not activating case insensitivity by default or under which circumstances is pointless bikeshedding at this point. That’s not the hard and important part, what is is getting a system in place that can handle case insensitivity. Pointing out that you don’t want that doesn’t do any good, it’s clear that it’s useful and needed in many cases. If we ever get there and you feel like your user case is threatened, that’s the moment to speak up. However so so calmly and politely, chances are much better then that you are taken seriously. Most definitely don’t quote someone partly and out of context, that’s just super annoying.

3 Likes

I said it as a default is not a good decision, not that having case insensitivity support is bad. Talk about partial quotes, you took half of my sentence. (It being the default or not is also what the linked GitHub issue I reacted to talks about)

I don’t see anything non-calm about what I have said. Maybe not the most delicate but definitely not rude. Seeing how people in the GitHub thread were very polite but got nowhere and the issue got locked without a solution I had to bring up the topic here. I digress, discussing the tone of how I voice my concern about an anti-feature being the default is indeed bikeshedding.

I’m sorry, but it actually is. And not only a good decision, but the only viable decision at this point.

Syncthing’s first priority is (and was as long as I remember) not to lose any user data unexpectedly. Losing user data is absolutely the worst, and trumps any other concern such as performance, usability, etc. — says so in the Project’s Goals.

If (or, hopefully, when) we finally implement case-insensitive behaviour, we only have two options: have in enabled by default or have it disabled by default. So there are six possible user stories:

1a: Insensitive by default, user only has sensitive filesystems. This is inconvenient; Syncthing will not lose data, but may not sync some of it, and needs to be told to be sensitive. Bad, but not critical.

1b: Insensitive by default, user only has insensitive systems. Everything just works.

1c: Insensitice by default, user has a mix of systems. Some of the data may not start syncing right away and require user interaction, but nothing is lost unexpectedly. Not really inconvenient, because what else would you do.

2a: Sensitive by default, sensitive filesystems. Just works (the way it does now).

2b: Sensitive by default, insensitive filesystems. Works (the way it does now).

2c: Sensitive by default, mix of systems. Data loss is guaranteed, the way it is now. Very bad, should be avoided at any cost.

See? Having case-sensitive behaviour by default is basically incompatible with topmost priority project goal.

4 Likes

This requires case-folding to happen exactly the same on all platforms and within Syncthing. This is something that following Unicode standards should allow, but considering the track record of the main proprietary OS vendors on that front, I wouldn’t dare take it as a given.

And why can’t these two be auto-detected based on the connected devices instead of slapping the default on everyone?

That data loss happens at very specific occurrences, why can’t this require manual action like 1c?

What you’ve said is correct only if the current system can’t be improved, but that’s really not the case.

There are three issues with this that make such kind of autodetection a huge endeavor in and of itself:

  1. A given machine may not be connected to every other device in the swarm. Worse, it may not even be aware of all the devices and their properties. Consider the config where machines A and B share a folder, while A shares the same folder with C, D, and E (none of whom is aware of B), and B shares the same folder with F, G, and H (none of whom is aware of A). How is C supposed to even know about H?

  2. A shared folder is on a Linux box with case-sensitive EXT4, but there is a subdirectory mounted inside it, and that mountpoint holds a case-insensitive FAT32 filesystem. Autodetecting these situations reliably is not something I’d aspire to implement.

  3. A further complication of cases 1. and 2., where either a new case-sensitive machine is added to otherwise case-insensitive swarm (or vice versa), of a new case-sensitive subdirectory is added to a case-insensitive folder (or vice versa). Good luck trying to implement reliable autodetection of these cases!

Because 1c without user interaction does not lead to data loss, while 2c does. Data loss is worse than inconvenience, period. We can have a user jump extra hoops to make things work as he/she expects (or at all), but we can’t, shouldn’t and mustn’t have a situation where simply installing and running Syncthing without checking a very specific option somewhere in it results in losing user data.

3 Likes

It’s actually not as bad as it sounds. During my attempt to storm this issue last year we (I’m still awed by the notorious patience and cooperation of all the maintainers during those weeks) managed to implement a case-insensitive version of FakeFS (the filesystem mock parts of the Syncthing code are tested against) that reliably mimicked the behaviour of filesystems on all the OSes we compile Syncthing for. It’s part of the test suite now, so should something break or change, we’ll be aware of it.

4 Likes

That’s good to hear, because I’ve occasionally had issues with case folding that just seem bizarre. Although, come to think of it, the more recent weirdness may be partly because of OS X’s decision to use NFD rather than NFC.

That specific weirdness we already do handle, at least.

And, yes. Why, Apple. :frowning:

2 Likes

Except when someone relies on syncing being case-sensitive.

No. There is a real difference between saying “there is a conflict, we didn’t sync this file, you need to twiddle a config” and “you had two files that only differed in case, so we clobbered the data in one of them to match the other, sorry”.

It seems like you’re trying to argue that your convenience in avoiding to flip a default setting trumps other people’s risk for data loss. That’s not how this project works, and it’s a frankly inane discussion to even have. Even more so because none of this exists as code and the potential problems and solutions haven’t been fully explored yet.

4 Likes

It’s not inane to discuss if it’s even necessary to create hassle for any people. If none of this exists as a code it is the perfect time to discuss this. Especially if you’ve just said that “the potential problems and solutions haven’t been fully explored yet”.

You brought up two scenarios, let’s take those.

Why can’t this case:

you had two files that only differed in case, so we clobbered the data in one of them to match the other, sorry

Be turned into this instead:

there is a conflict where two files only differ in case, we could’ve synced these but this might result in data loss, do you want to change the config Y/n

That’s what I’m trying to understand and get an answer to (like nekr0z nicely did once already for one question), instead of getting dismissive non-technical “it’s a project goal”, “it’s a data loss risk” or a thread closed.

One more thing though, if it hasn’t been clear I have never said that I want people to lose their data or even risk losing their data, it’s an understandable concern and I’m grateful for the priority as a very long-time Syncthing user.

I try, but fail, to grasp how that is not exactly the solution @calmh is suggesting. Answering that question requires user interaction, so whatever is said in the message and whatever buttons it offers, is by definition not part of what happens without user interaction.

1 Like

Well, this is exactly what we aim at when we say “Syncthing should be case-insensitive by default”.

When we say that Syncthing will default to case-insensitivity, we don’t imply that somehow all your "fOO"s, "Bar"s, and "BaZ"s will sync as "foo"s, "bar"s, and "baz"s, respectively, or will undergo some other case-folding. No, we shall be preserving the case, at least that’s what we aim for. But in case Syncthing sees “FOO” and “foo” in the same directory, it will stop working on this folder and display the error message, basically along the lines of what you’ve just suggested.

This will add the extra step for users that have only case-sensitive systems and expect “foo”, “Foo”, and “FOO” to sync as three different files in one directory; this is a perfectly valid usecase that we will continue to support, but it will require the user to click on the (intentionally) very scary red button “Yes, I do want this folder case-sensitive, and I understand that as soon as I share it with at least one machine with a case-insensitive filesystem, my data will be screwed up beyond any restoration!”

We believe this one extra step for this one category of users is a perfectly acceptable price to pay for avoiding silently destroying data on machines of many other users. And when I say “many”, I’m basing my guesstimation on the fact that our stats show about half the users use Syncthing on Windows and OS X, and it is likely very common that these users at some point purchase a Linux-based NAS for their homes, at which point their data is in peril.

Huh. There’s a point I’m not sure has been raised before. Case folding and case sensitivity are not the same thing. SQL, for instance, is case folding but case sensitive. (Try quoting lowercase identifiers in SQL.)

Also, a point I just realised I should have made in the earlier reply:

It would create a conflict, yes, but how is that different from the usual case of inconsistency between file contents? Calling this ‘data clobbering’ only makes sense from a perspective where case sensitive behaviour is the only conceivably correct thing, which is flatly not the case – even if you prefer it. (As, frankly, I do, but that’s neither here nor there.)

The issue being discussed here is not that systems A and B have files “FOO” and “foo”, and they get synced (or conflicted) — this one is simple and is basically an ordinary conflict, nothing too scary.

The real issue is this: A is case-sensitive and has “Foo” and “foo”, B is case-insensitive and tries to sync both, syncs them into one file and simply loses the contents of one of them, propagating it further.

The even bigger problem is that trying to rename “BaR” to “bar” on B results in the file being wiped totally: B tells A it created “bar” and removed “BaR”, A does that and tells B that yes, it removed “BaR”. B looks for “BaR” on its filesystem, sees it (because “bar” is there, which is the same for B), removes, the remove is propagated, and there you go: you tried to rename the file, and that file got removed completely.

2 Likes

Hmm. It seems I missed where the quote came from, specifically. My misreading; sorry.

Let’s take this case again. What I’ve been trying to figure out is why can’t device B that does the data-destruction throw up a big scary prompt instead of the device A that just has the files?

It’d be safer than the current way and the way you’ve proposed because it’d eliminate the chance of someone adding a destructive case-insensitive device to a case-sensitive swarm, that is the actual problem as you’ve highlighted.

Is the distinction understandable or did I still phrase it confusingly?

First of all, no single device does data destruction; data destruction happens because of the combination of the devices in the swarm. Every device does its best to do the right job, but the combination of efforts happens to be destructive. In some cases one might argue that it is the case-sensitive device that is a data destroyer, but that’s not the point.

Filesystems don’t report whether they are case-sensitive or not, and from the software’s (i.e. Syncthing’s) perspective the only way of knowing that the filesystem is case-insensitive is create a random file foo, fill it with some data, and then read the file FOO from the same directory and find all the same data there — then we may assume that either foo is a hardlink for FOO (we created it, so it hardly is), or the filesystem is case-insensitive. Doesn’t look like much trouble doing this check, but as I said earlier, every directory can be a mountpoint (or become one) and end up case-insensitive, and re-checking every subdirectory on every write is just not worth it.

Hence, there’s no simple way to know for a case-insensitive system that it is one. However, we can detect danger when we see foo and FOO in one directory (and you should obviously be on a case-sensitive filesystem for this to happen) and put up the alert; this, at least, looks doable from the programming point of view. Of course on an all-case-sensitive swarm this can be normal, but since we have no way of knowing that the swarm is all-case-sensitive (for the reasons I have described earlier in this thread) we’ll need user to explicitly enable this kind of behaviour.

Now, there may be better approaches to the whole thing, granted. As @calmh has already mentioned, implementing case-insensitivity is really hard, and we’re still not very close to getting there (not because of the lack of trying, mind you), and for all of this to even be of importance we must first get case-insensitivity sorted out. But at least this approach looks sane to the maintainers, and I have yet to see someone suggest a better one.

1 Like

It’s also tricky for the case insensitive device who already has Foo and gets an update for fOO. Is it a problem that should be flagged? Is it an update to the same file just spelled differently because it comes from another also case insensitive system? Was the file just case-only renamed? I’m not 100% sure how to tell.

On a case sensitive system it’s easy to see when we have two such files and set a hypothetical “requires case sensitivity” flag on both, causing the warning/error to happen on the case insensitive side.

But what about two case sensitive systems with one insensitive in between? How would it understand what’s going on?

And my gut feeling is still that if someone has two computers, one with “Club meeting 20191230.txt” and the other with “Club Meeting 20191230.txt” the odds are greatly in favor of them actually intending these to be the same file and not two separate meeting notes beside each other. Regardless of what the file system would think. Case insensitive by default acknowledges this. I’m sure there are people who love to have files beside each other that just differ in case, but it’s not going to be the majority.

1 Like