Symbols & Non-English Alphabets in Filenames

[ Backstory: Until there is some system in place for substituting or compensating for non-English alphabets and other symbols in filenames, I’ve been successfully solving naming problems on a file-by-file basis, so I thought I’d share my experiences. I couldn’t find a reliable (much less cross-platform) tool or library to convert character sets and combine accents. Many tools, like the excellent convmv which I use on my Linux and Mac OS X clients, can enforce NFC UTF-8, but I still haven’t found one with combining-character to singleton conversion. Most of my success in renaming stemmed from reading the relevant syncthing open issue enhancement request on github. Subscribe (or contribute!) if you’re so inclined. Otherwise, If anyone has any corrections or contributions here, also please do tell! ]


Filenames must consist entirely of NFC UTF-8 characters.

In a nutshell, this means every character in your files’ names must be a single unicode symbol, letter, or number. Find the single utf-8 character for the needed letter including any accent, and substitute it into files’ names.

There is no Web GUI warning when files can’t be handled by syncthing. Short of looking through logs, users will notice the obvious symptom; files with non-compliant filenames will not be synchronized.

Unicode includes almost every character you can think of (plus oodles you can’t), so the name you want really isn’t the problem; it can instead be reduced to an encoding problem. For example, none of my Russian files were being picked up. Their names were not compliant because rather than single unicode characters, they included letter–diacritical mark combinations. The correct and incorrect encodings are visually indistinguishable. For example, this: й is one character, while this: й is two characters; the similar letter и plus a combining-character diacritic breve ˘

Wikipedia gives another nice example of how encodings can differ, usually either to make computers tolerate it, or to preserve the actual meaning of distinct symbols that may nonetheless appear identical:

…distinct Unicode strings: “U+212B” (the angstrom sign “Å”) and "U+00C5" (the Swedish letter “Å”) are both expanded by [decomposition] into the sequence "U+0041 U+030A" (Latin letter “A” and combining ring above “°”) which is then reduced by [composition] to "U+00C5" (the Swedish letter “Å”). —Unicode Equivalence, Normal Forms

Fixing it

So, if you know the name of the letter or diacritical mark or symbol breaking your sync, look for it in a unicode character list, then copy and paste it into your un-synching filename. This will work on any OS.

If you have a choice of platform, type the filename out or insert the symbol on Windows. Keyboard-generated diacriticals and combining characters from Mac OS X (e.g. option-c yields ç) seem to be composed to single NFC characters, but many inserted from Character Viewer or simply typed on non-U.S. keyboard layouts are not.

You may want to use some unicode-savvy tool to show you what character you actually have. In rishida.net’s user-friendly unicode conversion tool for example, upon hitting the Convert button, any text pasted into the first field is shown as its constituent entities below. In particular, look to see if your symbol or accented letter “costs” two codepoints in the “Unicode U+Hex Notation” conversion box. As above, my й converts to one: U+0439; but my й costs two: U+0438 U+0306. Only the single character syncs.