Continuing discussion from https://github.com/syncthing/syncthing/pull/6344, @imsodin.
Problem statement
So we have a bunch of “services” running in a tree type of setup. These get restarted when they exit with an error, and we use this both for restarting services on config change (intentional return) and to handle errors that might be resolved on retry (like failing to open a listen port – the service can’t continue so it exits, gets restarted, and so tries again).
In some cases where it’s absolutely clear that we cannot continue we panic, taking the whole program down.
It would be nice if we could stop the program in a more graceful way, when we run into a situation that requires it. The typical example is a database error – if we cannot get or update items in the database any more, then we cannot continue syncing.
Potential solutions
1. Control object
Imsodin already implemented this one, where we pass some handle around everywhere, and a service can tell this handle that everything needs to stop. When that happens, stop is initiated from the top.
2. Smarter error returns
One could imagine that if suture were smarter, it could recognize a certain error class as fatal, and when getting one of those from a service it would cease restarts and tear down the tree.
3. Service classification
Again, if suture were smarter it might be possible to specify what should happen when a service fails at a more fine grained level. Like, “restart forever”, “restart a couple of times and then give up by tearing down the tree”, “service is essential, don’t restart and just tear it down directly”. Not clear how we would handle intentional restarts of essential services. Also, it’s often not the service that’s essential, it’s the error that’s fatal, so not sure this is a useful thing…
4. Fuck it just panic
I mean, we’re going down. Things are already dire. Why does it need to be graceful? The panic decision would need to be made at the point where the severity can be determined though. Like, the database doesn’t know the ramifications of returning an I/O error, but the thing on the other end that tried to save something and got the error does. If it’s some statistic, maybe just let it slide. If it’s a fileinfo for a newly synced file, need to abort.
5. ???
I’m not super fond of #1, it’s intrusive and a bit weird – contexts and parameters signal downwards, returns signal upwards, imho, and this turns that around.
#2 and #3 don’t exist today, although #2 could possibly be implemented as a service wrapper that takes some sort of control object like #1 (maybe the top level supervisor, for it to stop).