Server outage report

So we had an outage on a couple of things. The reason was that those things live on a server of mine hosted at a place, and there was a fiber cut to that place. However it could, for the same effect, equally well have been a failure of the server hardware or the firewall that protects it. Most of the things living on that server are quite non essential:

  • The forum. I’m sad when it’s down because I enjoy it, but in the end no one gets hurt and files are still synchronized all over the world.
  • The website. We make a less than optimal impression with it down, but again it’s a minor inconvenience.
  • One of the three discovery servers. The discovery service is redundant for this reason, so no worry.
  • The build server. Hampers development, prevents doing an actual release in the proper way while it’s unreachable. We get by.
  • The usage reporting server. We get a blip in the usage reports while it’s down but that’s all.

However a couple of things hurt a little bit more:

  • The relays registry. Without this, Syncthing clients can’t get the current list of relays so run without relaying.
  • The APT repository. Debian users can’t upgrade or do new installs using apt while this is down.
  • The docs site. Users should be able to access the documentation.

The reason this is all hosted on the same box at the same place is pure convenience, laziness and economy. I don’t pay to spin up more VMs on my own hardware. However, this obviously won’t cut it, going forward, so here’s what I think we should do:

  • The relay registry should be integrated with the discovery service and enjoy the same resiliency. If this moves to a DHT or something in the future, we get that for free for the relays as well.
  • The web site and APT repository should be moved to a real cloud provider.

I’ll probably continue running the forum and, especially, build server for a while as these are happiest as fairly beefy VMs that are expensive (for us) to host externally. In the long run this should change too, both to remove a dependency on my hardware and hosting, and to remove a dependency on me personally.

8 Likes

I was worried it could be a DDOS attack

What are the needs of an additional server? I own a (very) small ISP, I might be able to offer you a VPS or two in Canada.

I think we should aim for two things - resiliency so that a single failure doesn’t take out something essential, and a reduced dependency on individual goodwill. We already have resiliency on the discovery service, certainly won’t reduce that resiliency, and should move the relay registry to the same service. That’s really it as far as essential services are concerned, any other things are conveniences but don’t affect people’s usage of Syncthing as such.

For those I think the best choice is paid hosting from a well respected cloud provider. Outages with those should be measured in minutes, and there are people looking after stuff twenty four by seven. I currently provide hosting for these, because it’s easy and free and works lovely when it works, but it doesn’t make sense to depend on me for this forever. Likewise I wouldn’t want to “just” add more volunteers to the mix.

2 Likes

8 posts were split to a new topic: Discovery resiliency / DHT