Shutdown on error

Continuing discussion from https://github.com/syncthing/syncthing/pull/6344, @imsodin.

Problem statement

So we have a bunch of “services” running in a tree type of setup. These get restarted when they exit with an error, and we use this both for restarting services on config change (intentional return) and to handle errors that might be resolved on retry (like failing to open a listen port – the service can’t continue so it exits, gets restarted, and so tries again).

In some cases where it’s absolutely clear that we cannot continue we panic, taking the whole program down.

It would be nice if we could stop the program in a more graceful way, when we run into a situation that requires it. The typical example is a database error – if we cannot get or update items in the database any more, then we cannot continue syncing.

Potential solutions

1. Control object

Imsodin already implemented this one, where we pass some handle around everywhere, and a service can tell this handle that everything needs to stop. When that happens, stop is initiated from the top.

2. Smarter error returns

One could imagine that if suture were smarter, it could recognize a certain error class as fatal, and when getting one of those from a service it would cease restarts and tear down the tree.

3. Service classification

Again, if suture were smarter it might be possible to specify what should happen when a service fails at a more fine grained level. Like, “restart forever”, “restart a couple of times and then give up by tearing down the tree”, “service is essential, don’t restart and just tear it down directly”. Not clear how we would handle intentional restarts of essential services. Also, it’s often not the service that’s essential, it’s the error that’s fatal, so not sure this is a useful thing…

4. Fuck it just panic

I mean, we’re going down. Things are already dire. Why does it need to be graceful? The panic decision would need to be made at the point where the severity can be determined though. Like, the database doesn’t know the ramifications of returning an I/O error, but the thing on the other end that tried to save something and got the error does. If it’s some statistic, maybe just let it slide. If it’s a fileinfo for a newly synced file, need to abort.

5. ???


I’m not super fond of #1, it’s intrusive and a bit weird – contexts and parameters signal downwards, returns signal upwards, imho, and this turns that around.

#2 and #3 don’t exist today, although #2 could possibly be implemented as a service wrapper that takes some sort of control object like #1 (maybe the top level supervisor, for it to stop).

We also have to take into consideration why we want this in the first place.

We want it to be possible to embed syncthing in other processes, whether its an android app or a Qt ui.

For that reason, it needs to be graceful on failure, and in these cases “fuck it and panic” is not a great option.

2 Likes

Right, so #4 is not a real option.

Service returns are actually not handled consistently. Sometimes we only return on unexpected/irrecoverable errors and context cancellation and handle expected errors/restart conditions ourselves, sometimed we use it in “normal control flow”. I actually intended to clean this up independently of the teardownissue, and use suture really exclusively for global start/stop and failures (e.g. the service termination logging from relay listening when internet is not accessible is abysmal in my opinion). However points 2 and 3 would be able to solve that as well.

I really like 2. For 3 my gut says this will get complicated and thus somewhat failure-prone, while service handling is super integral/critical. We already do extend suture somewhat so why not wrap it entirely and interface only through are own wrapper which has teardown functionality.

Actually writing this I think suture already has all we need: Unset passthroughpanic on the toplevel supervisor and use failirelogger to react to panics. Aka do 4 gracefully

My gut also says 2, but I know suture somewhat does not support that, and we’ll have to wrap it even more, up to the point there is not much suture left.

I suspect we might end up having to fork it, as I am not even sure we’d be able to wrap it enough here.

I suspect that the changes we need are extensive enough that it might no longer be suture or fit with the initial design goals, yeah. At least it would be incompatible;

  • Serve methods need to become func Serve() error instead of the current no-error return.
  • A supervisor needs to look at the returned error, and if errors.As(..., &suture.FatalErr) or something like that, then the supervisor itself should stop its other children and return the same error

This would automatically bubble and the top level Serve() would in the end return with the same error…

Our serve methods could then do something like

if err != nil {
  // uh oh, this is a bad one
  return suture.FatalErr(err)
}

and the whole thing comes crashing down.

Maybe this could all be implemented as a wrapper type around suture somehow. Something that would have it’s own Add() method, wrapping the passed in Serve() method to inspect the returned error and then call stop on the embedded supervisor… And return from a Serve() that would have wrapped the supervisor Serve()…

And this is too hacky? Would be pretty minimally invasive.

I don’t think it’s that trivial. This encourages just to panic when shit is weird, and that panic may well be in routine which is not under the supervisor that takes the while thing down.

I think lib/suture would probably be fine. We already have out own util.ServiceWithName and serve(ctx) error. So I guess this might just be an opportunity to clean this up (have our own service interface with our own service manager methods).