Cross platform: character sets and filename limitations

usernamegoeshere · January 27, 2016, 5:33pm

Hi there, rather new syncthing user here. Have some windows syncthing cooperating with apple mac os x syncthing. Mac users now have funny ways and habits to name their files, e.g. colon or slashes or backslashes you name it. You cannot teach non-it people about all these limits that exist in the software or technical worlds. Anyways I get tons of warnings about illegal chars on the windows machine in the notification area displayed in yellow warning, above all on the main localhost syncthing webpage.

I was thinking about a better solution of this problem, regarding the maybe potential data-loss or leaving the windows box behind or out of the game in these situations. Am I right that the windows machine doesnt receive these badly named files and thus never receiving and keeping a copy of these incompatible files from the mac boxes?

If yes, can you please make syncthing use temporary workaround or placeholder filenames or reserved filenames and keeping relevant information such as original filename in its metadata or somehow side-channel structures, as to if maybe mac users (or also other way around) decide to rename their files at last making them compatible to cross-platform wise, then the filename could be then applied everywhere.

Maybe filenames could be the actual individual hash (sha256?) of the file/payload or similar.

Any comments? Am I the only one running into cross-platform trouble with end-users?

Thanks and cheers.

calmh · January 27, 2016, 5:59pm

Yes.

This is tricky and I haven’t seen a proposal yet that I liked. Specifically, I don’t like…

replacing characters with underscores or similar: how do we tell the difference between a file just called foo_bar.txt and one renamed to that from foo:bar.txt? What if both exist on the Mac?
renaming files to something like above and keeping the actual file name in the index (side channel): what if we lose the index and forget the mapping?
renaming files to the SHA256 hash of the actual file name or similar: too user unfriendly, and again, what if we just happen to have a file called that to begin with? Also this is one way so again depends on the side channel.

That’s what I come up with from the top of my head, but throw in more ideas and I’ll try to shoot them down. If I can’t, we can implement it.

One thing we could do (but that I don’t particularly like either) is to rename or suggest renaming the files on the source (Mac or Linux).

usernamegoeshere · January 27, 2016, 6:19pm

Thanks for the speedy relply. But losing a filename is not as bad as losing the whole file itself. Doesnt syncthing have those .stfolder and similar metafiles already. Why not always name e.g. an incoming or incomplete and in-transfer file always in such a way and only rename it when it has safely arrived and tested (checksum tested? block tested?). And only during this final rename stage the rename would fail on a platform incompatible with the character set, character limitations (or similar). Then the user(s) would have the file everywhere and could take care of handling the renaming later.

Or we need to implement a notification across all platforms and warn users to the least common denominator of all the participating syncthing platforms. As long as mac only users sync, dont warn them e.g. on colon or whatever, but when the first windows user joins their sync, you need to give out warnings as this windows folk would never receive all the files that exist.

There gotta be some clever way to handle this stuff. Thanks for your work.

canton7 · January 27, 2016, 6:21pm

That’s basically the same as the sha256 option, surely? Give it a really cryptic name that’s likely to be unique, and let the user figure out what it is and rename it later.

calmh · January 27, 2016, 6:23pm

We sort of do what you suggest, in that a file being downloaded is saved first to a temporary file, then renamed to the target. Except the temp file name is based on the actual file name so that fails as well in your scenario. We could change that - we already to hash based temp files when the file name is too long. But it wouldn’t help much as you’d just gain a hidden, strangely named temp file. Then what?

As for data loss… The scenarios I outlined in my first response are all scenarios that could cause data loss if implemented. Currently there is no loss, there is just lack of synchronization.

I’m sure there is, but so far we’re not clever enough to find it.

usernamegoeshere · January 27, 2016, 6:25pm

Well maybe I am mis/ab using syncthing then. People around here use syncthing to create a mirror of their stuff on a different place. So first place dies, second place never received oddly named file. This is called dataloss where I dwell :), isnt it? Thanks.

Okay if all fails, then at least syncthing users need to be made aware of when they are participating in crossplatform with these kind of limitations, that these exist and that they are potentially missing files or not delivering files to everybody the coop with and so on. I learned the hard way. And the longer crossplatform folks coop with synthing the larger their trouble will get as the more files with incompat names will amass and they will have much harder time later on to sort it out. If they learn early to ‘behave’ well with each other or use only common characters to all platforms, the better. I would still prefer very much a suffix added to filenames though to show meta state and handle filenames or crossplatform some side-channel way or something…

calmh · January 27, 2016, 6:26pm

You’ll get a chorus of people saying “Syncthing isn’t a backup solution” in a moment. There is some truth in that. And it seems you want a backup solution, probably one that does snapshotting and saves things not under their original name. Not least because Syncthing syncs the deletes as well (although versioning does help a bit).

That’s not to say this is not a problem that I’d like solved. But I’m not buying into the data loss part of it, currently.

On the same topic, there’s another scenario you want to be aware of that does cause genuine data loss - having files file.txt and File.txt on a Unix box (case sensitive file system) and syncing those to a Mac or Windows box (case insensitive file system). This has a higher priority in my mind but is unfortunately also a bit tricky for various internal technical reasons.

usernamegoeshere · January 27, 2016, 6:31pm

Actually: thinking about it at a larger scale or more universally. How do smb folks or networking people cope with these problems. What if a mac user wants to put a colon named file to samba share and a windows user wants to fetch this. Has nobody else outside syncthing run into this or similar interop problems? How has anyone else solved this problem? I bet we are not the first to discover this mess across platforms? I wonder.

calmh · January 27, 2016, 6:32pm

In an SMB setup that gets refused with an error message at file creation time, so they can’t. We lack that central authority. I don’t know of any good precedents for how this is handled by other cross platform client-to-client sync things. I welcome any examples.

usernamegoeshere · January 27, 2016, 6:33pm

Yes you are right. There are too many platforms out there So much work and workarounds needed everywhere. I would like to teach users to behave in certain ways if syncthing would also help in this matter. Tell syncthing user then more prominently that he is doing it wrong Maybe they will all learn to avoid the common problems and shortcomings of the industry.

usernamegoeshere · January 27, 2016, 6:35pm

Yeah that was only an example, maybe the upper lower case of your example is more likely on smb, but i think they map the filenames to some unified case or something. Anyways, maybe i can look at bittorrent sync, how or if they solve this.

Btw, they just recently came up with those crypted folders stuff. Is syncthing going to integrate that functionality (I have seen some pull request) soon too? Thanks.

calmh · January 27, 2016, 6:36pm

It needs someone to do it, is all. And it’s not particularly easy.

calmh · January 27, 2016, 6:44pm

For what it’s worth, scp for example doesn’t handle this particularly gracefully either. Source:

jb@zlogin4:~ $ uname
SunOS
jb@zlogin4:~ $ ls -l test/
total 1
-rw-r--r-- 1 jb other 6 Jan 27 19:41 file.txt
-rw-r--r-- 1 jb other 6 Jan 27 19:41 File.txt
jb@zlogin4:~ $ cat test/file.txt 
file1
jb@zlogin4:~ $ cat test/File.txt 
File2
jb@zlogin4:~ $ scp -r test/ 172.16.32.182:
file.txt                                                                             100%    6     0.0KB/s   00:00    
File.txt                                                                             100%    6     0.0KB/s   00:00    
jb@zlogin4:~ $

Destination:

jb@syno:~ $ uname
Darwin
jb@syno:~ $ ls -l test
total 8
-rw-r--r--  1 jb  staff  6 Jan 27 19:43 file.txt
jb@syno:~ $ cat test/file.txt 
File2
jb@syno:~ $

That’s the wrong data in that file, and the other disappeared.

rsync does the files in the other order (File.txt, then file.txt) so the data is correct. However changing File.txt on the source and rsyncing again then overwrites file.txt (of course).

usernamegoeshere · February 9, 2016, 9:36am

I have one more argument actually why it would be great if we could spread raw files around even when a filename is not locally assignable.

My steady windows node for example is acting as a sort of hub, which is always online. The macs are relying on it and not always available. So the mac users create their incompatible filenames and the hub doesnt receive these files unfortunately, and the mac leaves again and the other mac joins at a later time again but never sees and receives the file as the first mac could never push it to the online windows node.

I am using syncthing for colaboration. Any chance of raw transfers and metadata separation? Thanks.

AudriusButkevicius · February 9, 2016, 9:59am

I think this is a fair amount of effort for a limited set of usecases, with limited yields. It’s probably easier for you to spin up a linux vm in hyperv on your hub.

usernamegoeshere · February 9, 2016, 10:53am

Isnt ‘my’ idea with the raw files necessary in general for those crypted folders idea just the same way and for many other things? why not care for files and their unique sha sum and work from there instead of not working at all just because of charset limitations and such? I dont understand why my idea seems to be dismissed so massively and my point is not being seen. It would benefit the whole idea in general if we regarded the files as raw content and only their timestamp (probably not even that) and more their content and fingerprint would be taken into account. Everything else can or should be handled as metadata and on top of that. Am I so wrong? I dont feel like I am asking for an exotic feature request but I am having a real world shortcoming and non-functionality of syncthing although there is no sane reason for it to not handle the actual payload and data neutrally from any naming or characterset limitations. Data is what matters here is what I think. Thanks.

canton7 · February 9, 2016, 11:02am

Syncthing doesn’t just synchronize databases of files: it also has to interact with files as they exist on disk. Saying that the filename is “metadata” is all well and good, but it actually has to write a file with that name on all devices that that file is synchronized to.

When you start saying that "File X is called X.txt on-disk, but the metadata database actually says it’s called “X́.txt” then things start getting much more complicated.

Encrypted folders are different because the files don’t appear on-disk in the same way, so “your” idea is more applicable. However encrypted folders are an awful lot of work (as you can see in the other forum post) and the work is still ongoing: this stuff isn’t easy.

Sure you could make a case for a third type of folder which is just used as a central point of synchronization, where the files don’t exist on-disk with the same name they’re given on other devices (so you have “normal” folders, encrypted folders, and your third type of folder). This would however be a pretty significant chunk of work for a very niche use-case (“normal” folders work for this 99.999% of the time).

EDIT: And I imagine encrypted folders - if and when they arrive - will cover the other 0.001% of use-cases.

AudriusButkevicius · February 9, 2016, 11:14am

Essentially @canton7 has produced the same answer as I would have.

So it’s true that we could have a special type of folder which doesn’t store files, but stores loose blocks, completely ignoring metadata keeping it in the db and things as such, but as I said, it’s a fair amount of effort, with things such as reconciling, state checking, refcounting blocks etc. If you feel like you want to work on this, I am happy to guide you and help you.

usernamegoeshere · February 9, 2016, 1:20pm

I actually thought about a solution to file names. Does syncthing have in-flight filenames while the file is actually being tranfered (e.g. on the empty side if a new file gets created saved and appears in a folder on the remote side)? If yes, then we could keep this in-flight filename if the actual finalised filename is incompatible with the local system. Also: if (as far as I understand it does) syncthing has internal filenames and reserved filenames inside actual userdata-folders, then we can make use of this reserved style mechanism, and is this to create something like described above, an intermediary temporary in-flight filename, and if its incompatible with the local character sets or filesystem, then it could be for example generated according to some transcription or reencoding system instead for example like uuencode or some common subset of characters that everywhere is supported. And all sysncthing instances on all participating nodes can check themselves if their decode this encoded filename and create and name the file as it was originally or highlevel-wise meant to be, or if they need to keep the filename coded in the lowest common denominator way for a while. Would this sound like a valid idea and approach? Thanks for helping.

AudriusButkevicius · February 9, 2016, 1:36pm

The in flight names are same as final names just with a prefix and a suffix.

If you start reencoding stuff you need to maintain a list of what you reencoded to be able to look it up. Also, if you drop the database and loose the mappings the file will be considered new, etc.

You could argue that you name the files in some reserved way that people are not allowed to name, but then I am not sure if everyone would want this behaviour, and some might still want to drop the files.

So yes, this is solvable, but the right question to ask is who is interested in spending time on this and only then how do we solve it.