Infrastructure report (discovery stuff)

I’ve been doing some work on our infrastructure.

Our web services were mostly designed about a decade ago when containers where a newfangled thing and we had a handful of clients to serve, running on some lovingly handcrafted VMs at DigitalOcean. Some of those things have scaled more gracefully than others… I’ve shifted slightly how we deploy and run stuff in the meantime, and most recently moved our hosting from DigitalOcean to Scaleway, where we currently run almost all our services in a Kubernetes cluster.

DigitalOcean served us well for quite some time, but unfortunately less so after moving much of our infra into containers in Kubernetes. Their network load balancer had a never ending stream of issues with our number of active connections (even without moving discovery to it), plus didn’t support IPv6 and had no plan to do so in the foreseeable future. Scaleway has reasonable pricing, and their load balancers support IPv6 so we could actually move discovery into the cluster.

(Aside about pricing: our biggest constraint in choosing a cloud operator, price wise, is our outgoing bandwidth. Virtual machines cost more or less the same everywhere (within an order of magnitude) but outgoing bandwidth varies quite a lot. We serve about 15 TB of data each month, most of it discovery and relay pool responses. On all the big clouds this would give us an egress bill of well over 1000 euros per month. Several of the smaller operators have a much more generous allowance where it’s essentially free with the VMs, DigitalOcean and Scaleway being among those.)

Anyway, discovery. The discovery service is our main resource driver.

We get about 3k-4k requests per second in a steady state. You might think a few thousand HTTP requests per second isn’t that much and could in principle be handled on a fancy Raspberry Pi, and you’d probably be right, but our clients are special… “Normal” web requests are browsers, they open one HTTPS connection and send a hundred requests over it. We only ever get one request per connection, so every request is a TLS handshake, with a somewhat unusual certificate type at that, and (for announces) we must verify the client certificate. This uses a surprising amount of CPU: we currently average about six CPU cores (of some beefy Xeon variant) all the time just for TLS termination. (Compressing responses would make this significantly worse too, so I’ve turned that off for now.)

It’s also an odd workload because it’s very inelastic, and so becomes very latency sensitive. Again, many “normal” web services react to some form of backpressure when things are slower than usual – requests that depend on each other, like you don’t load the images on a page before actually getting the HTML that refers to them, etc. Discovery isn’t like that at all since each request is a singleton from a separate client. We get minimum 3000 new connections every second whether we had time to process the ones that came in last second or not. If we only returned an answer to 1000 of them we now have 5000 open connections that need handling. Handle 1000 of those and we have 8000 connections awaiting response next second, and so on.

This becomes very fragile to any kind of pause in processing: leveldb compaction and even GC pauses would lead to short stalls that quickly overwhelmed our Traefik proxies or load balancers where connections queued up. Recovering from this wasn’t always trivial since we essentially end up in a DoS situation with clients retrying timed out requests.

(System effect aside: another effect of client retries, like re-discoveries when a device is just not found, is the ringing-like-a-bell curve for any minor outage. We smooth this over time by returning a random-fuzzed Retry-After header for Not Found responses, but any blip in responses is still visible for hours.)

I’ve put a fair amount of engineering work into improving the discover server for this. I experimented with database settings, switching the database from leveldb to Pebble (made things worse), and finally settled on using an in-memory map – this was the only thing with consistent enough latency… Even then there was a fair amount of tuning and profiling required to get allocations low enough – and CPU bound goroutines short enough – that GC pauses didn’t kill us. The database is periodically flushed to permanent storage and read from there on startup.

There are three instances of the discovery server binary, all of which have the full set of data to be able to answer requests. Announcements are replicated between them using a message queue.

The Traefik proxies are assigned the bulk of resources on each cluster node and are configured to 429 requests when they queue up towards the discovery server (which they don’t, currently ;).

This has been stable for a little while now, enough that I’m confident it’ll work for us. Total resource usage is about 8 CPU cores and 6 GiB of RAM, but RAM usage spikes when connections spike for whatever reason. This is down from about 26 CPU cores at its worst prior to the latest rewrites; orange line here is Traefik CPU, yellow is discosrv. (The metrics gap around the beginning of the month is me migrating things back and forth…)

My code changes for these things generally happen on the infrastructure branch, and I merge them after stabilisation and cleanup. I haven’t asked for review on them, for all the practical reasons you might assume, but they’re up there for anyone who wants to take a look. There’s another fairly large change coming which I’ll write up shortly, and which I will put through review. :slight_smile:

9 Likes

No real feedback from me, but I just wanted to say thank you for all the work and improvements to keep things running stable and efficiently.

I also want to point out that the number of active devices was about 500,000 last year. At the moment, it’s almost 1,000,000 :face_with_hand_over_mouth:.

7 Likes

Yeah, hence this stuff :slight_smile: scaling… now if someone could actually implement a working DHT based discovery that’d be cool.

4 Likes

Just a thought, but some of the big corps try to scale down TLS handshakes by having widespread TLS resumption support. The resumption TLS handshake cuts down on resources quite a bit, namely by skipping all the asymmetric crypto*.

TLSv1.2 supported two mechanisms (session ID + session tickets) for this. Session IDs were okay security-wise, but difficult to distribute across instances in a load-balanced/distributed server environment. TLSv1.2 session tickets could be synchronized across servers if they all shared a single secret, but had pretty bad security properties. TLSv1.3 built upon the session ticket approach, but replaced it with a PSK-as-a-ticket system that also supported PFS, so overall is much better. The secrets used to generate the PSKs server-side can also be shared across multiple instances in a distributed system (very similar to how TLSv1.2 did it).

According to basic TLS debugging tools, your Traefik instances already do TLSv1.2 session tickets + TLSv1.3 PSK resumption handshakes, with an advertised max lifetime of one week (these all look like golang defaults, golang rotates secrets after one week by default).

Currently syncthing client’s don’t seem to do any resumption, as the golang config used for announcement doesn’t set a session cache, resulting in no resumption. Enabling that client-side (assuming it also works server-side) with reasonable lifetimes (>= 1 hour) could cut down on TLS resources quite a bit without sacrificing security - the TLSv1.3 PSKs have reasonable security properties, including PFS. Performance-wise it reduces the handshake by two P-384 signatures. It also saves some bandwidth, as server-side certificates do not get re-transferred (they should still be available in code, as the session cache remembers them). [Client-side certificates are implicitly retransferred via the PSK/session ticket].

*Except ECDHE, which is roughly 1/10 of the running cost of ECDSA according to my benchmarks. x25519 is a reasonably secure and fast choice here.

5 Likes

That’s a very interesting observation! It may be that Traefik doesn’t actually support this in a way that’s useful to us though :frowning:

(Or it could be marginally useful, each connection having 1/3 of a chance currently to reach the same instance as previously, cutting 30% of the TLS overhead…)

3 Likes

I wonder how leveldb would compare to an external KV store like redis?

I guess this does scale as you add more replicas serving the responses?

1 Like

If TLS termination is the limiting factor, a shared KV store might help :man_shrugging:

The two are not related - TLS is terminated by the frontend reverse proxies before the request hits discosrv, and anyway as I noted the discovery servers don’t use leveldb, they use an in-memory map. There could be a redis behind the discovery servers, but then that becomes the limiting factor that requires scaling and not being a single point of failure.