Overview of files

sciurius · April 1, 2016, 8:04am

If a folder is SyncThinged and has a .stignore file, is it possible to obtain a list of files that are eligible to sync, for the sake of checking the ignore patterns?

calmh · April 1, 2016, 8:10am

No.

(That would require actually scanning ignored directories, which we don’t, because they’re ignored. ;))

sciurius · April 1, 2016, 8:26am

My intention was to obtain a list after the ignore patterns are applied, so no scanning of ignored firectories would be involved.

calmh · April 1, 2016, 8:58am

Ah, a list of the not ignored files. Sorry, I understood you to ask for the opposite.

Hmm. I don’t think we have this available, no. It would be an exceedingly expensive question to answer as it essentially entails dumping the entire database.

AudriusButkevicius · April 1, 2016, 9:03am

You are better off walking the filesystem yourself and matching paths against the regexps the /ignore api call returns.

sciurius · April 1, 2016, 11:46am

Does this reflect all ignores, including silently ignored files? Or is is just a translation of .stignore rules into patterns?

Also, when the .stignore starts with a UTF8 BOM, this is not recognized and the BOM is prepended to the first entry from the file. For example, when the file starts with “!.git”, the first pattern is “^\x{feff}!\.git$” (and doesn’t have the desired effect).

calmh · April 1, 2016, 11:57am

No. For example, Syncthing’s own temp files are always ignored, as are .stignore, .stfolder and .stversions.

AudriusButkevicius · April 1, 2016, 11:59am

I think that includes only files ignored by .stignore. The files that are ignored internally to date are .stversions, .stfolder, .stignores.

Regarding the BOM, it’s a genuine character that is there but your editor chooses not to display it. You could put a lot of unprintable characters in there, and I don’t we should be responsible for the sanitisation of them.

Though you can make a pull request if you really care about this.

canton7 · April 1, 2016, 12:24pm

The Unicode spec states:

Systems that use the byte order mark must recognize when an initial U+FEFF signals the byte order. In those cases, it is not part of the textual content and should be removed before processing, because otherwise it may be mistaken for a legitimate zero width no-break space

I don’t know what counts as a “System” in their language: is Syncthing a “System”, in which case it can choose to not use a byte order mark?

OTOH, even Notepad adds a BOM when saving as UTF-8, so it might make sense to not render it in the name of being vaguely compatible with the tools included with major systems.

calmh · April 1, 2016, 12:24pm

It’s an annoying character. Ostensibly it’s a space so strings.TrimSpace that we run on the string should eat it, but it doesn’t. unicode.IsSpace doesn’t think the BOM is a space character. I’m not sure if a zero width nonbreaking space is supposed to be a space or not. It’s non-breaking so we can’t do a word break there and it’s zero width so it’s not visible. Not many space-ish properties left.

On the other hand it’s apparently expected for utf-8 files in Windows so we should probably support it.

calmh · April 1, 2016, 12:26pm

I’d rather define us as a “protocol” that always uses utf-8 and defer to IETF:

The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it “SHOULD forbid use of U+FEFF as a signature.”[7]

But no, we should strip the sucker as it’s expected on Windows. You’ll be hard pressed to make me generate one on save though. And this’ll all blow up in our faces if we ever start syncing that file, which we probably won’t.

canton7 · April 1, 2016, 12:29pm

From package/scanner:

If the first character in the source is a UTF-8 encoded byte order mark (BOM), it is discarded.

So it should be gone already, no?

calmh · April 1, 2016, 12:30pm

Nope, we use a bufio.Scanner not a text/scanner.Scanner, but that fooled me also when first googling it.

canton7 · April 1, 2016, 12:30pm

Aah damnit.

calmh · April 1, 2016, 12:32pm

But it should be a matter of

if strings.HasPrefix(line, "\ufeff") {
  line = line[3:] // it must be utf-8, so three bytes for that thing
}

I’m sure there’s a BOMStrippingReader type somewhere out there as that’s fairly trivial, too. I guess we should praise the encoding gods that Notepad doesn’t default to UTF-16 or something.

Edit: Indeed: https://github.com/spkg/bom/blob/master/bom.go#L28