[Linux] Are periodic full rescans really needed when FS watcher is enabled?

desbma · December 9, 2018, 2:47pm

This has been imported from: https://github.com/syncthing/syncthing/issues/5357

Currently when the FS “watcher” is enabled (which uses Linux inotify API), the period full rescan period is automatically set from 1 min to 60 min.

This is somewhat in between two worlds because we pretend to watch changes in real time, but still schedule full rescans because it is considered unreliable.

This issue is to discuss the possibility of disabling the periodic full rescan in that case, which would improve performance.

The initial scan would still be needed:

to detect changes done when Syncthing was not running
to avoid race conditions at startup and define a point in time (after the inotify watches have been started), when all changes are guaranteed to have been processed

According to @calmh, the periodic full rescans have been kept enabled because there still exists some cases where events can be missed. See here for some context: #5353 (comment)

I would love to work on fixing this, but I would probably need some guidance because I am not familiar with the codebase or the Go language.

calmh · December 9, 2018, 3:25pm

As you yourself noted in the original issue, the path to accomplish no periodic scans is to ensure each and every change we care about generates an event, on every platform we run on[1]. And then make sure we can never drop events without knowing about it. I think this is difficult as you end up having to prove a negative.

1) Yes, I see the "[Linux]" prefix, but that's not generally how we operate. FS notifications are supported on the BSD:s, Windows, Mac and Solaris as well (I think).

AudriusButkevicius · December 9, 2018, 3:59pm

I don’t see anything to “fix” here. It’s working as intended from my point of view, as inotify is by definition unreliable.

desbma · December 9, 2018, 4:38pm

Why do say it is “by definition unreliable”?

canton7 · December 9, 2018, 4:51pm

I hope you’re willing to field all the support requests from people who have fallen through one of the cracks, and have things which are inexplicably not syncing… Guaranteeing that every possible way of modifying a file are caught by inotify in all cases (including lots of exciting races) falls into the category of “hard” programming problems, and every imperfection is going to annoy users and generate hard-to-debug support requests.

If you personally want to increase your rescan interval, go ahead, but pushing this sort of thing on other users is risky.

desbma · December 9, 2018, 5:09pm

There seem to be a consensus among you developers that inotify is not reliable.

I am not denying this, I am just trying to understand why it is not reliable and if it can be improved by fixing corner cases. Pointers to specific examples where inotify can fail to get events would be much appreciated.

If the OS compatibility or the risk of regressions are (understandably) worrying you, let this issue be about fixing missed inotify events, and not changing anything else.

I know that some filesystems (FUSE…) don’t propagate events properly. I know that yes, you have to be careful with races, and full rescans would still be needed to guarantee atomicity

What else is so bad about inotify?

canton7 · December 9, 2018, 5:22pm

For me, experience mainly. I’ve mostly worked with inotify on Windows, but there have always been cases where things weren’t reported. It’s very hard to track down exactly what went wrong. All developers working with inotify learn sooner or later that the only workable strategy is to build something based on scanning, then use inotify to trigger limited scans.

Also bear in mind that inotify is handled by a package which abstracts over the differences between inotify on different platforms.

AudriusButkevicius · December 9, 2018, 5:35pm

The reasons for inotify being unreliable was explained in previous post. Queue size when renaming large directory trees, not all events on all systems are reported. Be my guest, solve all of these on all platforms so we can disable scanning.

desbma · December 9, 2018, 6:04pm

Queue size when renaming large directory trees,

If I understand correctly this is an issue in Synthing code only, and has nothing to do with inotify reliability. Also the overflow can and is currently detected, so the condition that can lead to losing events is already worked around.

not all events on all systems are reported.

Agreed, there are cases when events are just not generated. But this is a documented limitation, which does not explain your “by definition unreliable” statement.

Be my guest, solve all of these on all platforms so we can disable scanning.

Ignoring your tone, and quoting myself:

If the OS compatibility or the risk of regressions are (understandably) worrying you, let this issue be about fixing missed inotify events, and not changing anything else.

AudriusButkevicius · December 9, 2018, 6:34pm

By definition unreliable for the purpose syncthing needs. Permissions, mtime changes, etc, are all things that we need.

If we were ever to address this, the first step would be to measure how much is not picked up by inotify, to have a metric to reliably tell that we can disable periodic scans.

calmh · December 9, 2018, 6:55pm

There are also practical considerations. In most cases a periodic scan now and then doesn’t harm, and it means we can be fine with a 99% solution for the notifications. If the alternative is hundreds of engineering hours on building, testing and verifying a 100% perfect notification backend and corresponding test suite … I doubt it’s worth it. But of course, if that’s what you want to do that’s great. We’ve all spent lots of time solving puzzles others thought were silly.

desbma · December 9, 2018, 10:01pm

The current hybrid “inotify + full scan polling” aproach is a pragmatic solution, and probably the best if the inotify reliability issues are real. But it is still a suboptimal workaround for what should theoretically be fully handled by inotify.

Yes most users don’t care about a full scan from time to time. Yet some do, like the user from original issue that can’t use his RaspberryPi for 10 minutes during a scan. There are also other valid use cases, where a choice between high latency of change detection, or a too frequent useless scan is a problem: for example I have a NAS that is idle most of the time, and the hard drives and fans spin down after a period of inactivity, which saves power and does less noise. The periodic scan prevents them from spinning down. Sure I can set the interval to one day or more, but it means no more fast sync.

Maybe I’m arrogant or naive, but I thought we could improve the current status quo without spending hundreds of hours of effort.

Sorry but I like to understand things, and I am not satisfied with the “it does not always work, work around it and be done with it” approach.

Maybe the issues you had with inotify were due to kernel bugs that were since fixed? Maybe it’s an obscure issue that nobody else has hit, and that has a one line fix in the kernel (I have already had one case like this in the past)?

That part of the kernel recently gained automated tests: ltp/testcases/kernel/syscalls/inotify at master · linux-test-project/ltp · GitHub It’s pretty basic, and nowhere the kind of stress that Syncthing would do, but there is a chmod test.

I bet many people at the time thought that writing a glorified distributed rsync clone with a web UI, in an alpha state programming language, did qualify as silly. Yet it now has thousands of happy users (including me).

Perfect is the enemy of good… but good enough is the enemy of great.

desbma · December 9, 2018, 10:07pm

What I would like to do first, is to reproduce a case where an inotify event is missed.

There any many moving parts and possible faulty components (kernel, libc, Syncthing code…) and test parameters (filesystem, file size, pattern of file change…), so any pointer or tip to reproduce is welcome.

calmh · December 9, 2018, 10:10pm

Yeah that was pretty much my point. Maybe I’m overstating the difficulty. Maybe I’m not. Noone will ever know unless you try. But beware of the rabbit hole.

imsodin · December 10, 2018, 2:26pm

Most timely an example appeared: https://github.com/syncthing/syncthing/issues/5360

That’s an error from the library that abstracts the different backends. The report is not super clear, but I interpret it as the watcher fails when encountering a symlink pointing at nothing. That might be an issue in this library or in kqueue/FreeBSD.

lfam · December 10, 2018, 2:54pm

The inotify(7) man page says this:

“Note that the event queue can overflow. In this case, events are lost.”

This is the official documentation of Linux’s inotify interface, telling us that inotify does not report all filesystem events, by design.

Yes, we all agree that it would be great if Linux and other OSs had a reliable filesystem events notification system, but they do not.

desbma · December 10, 2018, 9:05pm

Sure it can fail to propagate events it there is too much of them, but:

in that case an event to report the overflow is generated (and that one can not be lost by design)
the queue size limit can be increased

Just like Syncthing currently does detect when it hits the max_user_watches limit, to warn user and trigger a scan, it can do the same for the queue size limit. Actually I haven’t checked, but it may already be what is done.

“reliable” does not mean it is a perfect silver bullet, just that it works consistently as documented.

imsodin · December 10, 2018, 11:21pm

I think it’s a bit besides the point to discuss whether or not there are issues by design and whether anything else is really problematic or not. Hourly full scans are just a default setting that is fine for many use-cases. It’s not something forced on a user, you can entirely remove periodic scans if you wish to do so.

desbma · December 10, 2018, 11:42pm

As I said before, I don’t aim (anymore) at changing the safe but suboptimal default behavior.

I only want to find and fix the cases where the “watcher” misses changes. Every user can then decide whether or not enabling the periodic scans is worth it.

By the way it seems on Android, the periodic scans are disabled when the watcher is enabled. Why is the behavior different from the non Android version? To save battery?

nekr0z · December 14, 2018, 7:37am

Well, I can give you one example. Linux, EXT filesystem (reproducible on EXT3 and EXT4, didn’t test on earlier versions but shouldn’t be different), a file that is hard linked elsewhere. Some changes in the file happen elsewhere on the system (i.e. not in the directory we’re watching). Inotify has no way to spot the fact that the file we’re watching has changed and report it. Directory rescan, however, immediately sees that the file has changed.

I’m not really too much versed in how inotify works, but logical thinking suggests that since no access to the file in our watched directory was ever performed to begin with, there’s absolutely no way for inotify to get aware of the changes.