Relay network capacity?

Looking at data.syncthing.net / relays.syncthing.net, I find myself wondering – does syncthing need more relays? More than a simple “yes / no”, I’m actually interested in how one would figure it out for themselves. I wonder if it’d be worth adding a graph of relay network capacity vs current usage?

(It seems that there’s no current measurement for capacity on the stats page, and most relays seem to be configured as ‘unlimited’, which doesn’t help - maybe ‘peak observed traffic’ would be a good approximation, so we could graph sum(relay.peak_kbps for relay in active_relays) vs sum(relay.current_kbps for relay in active_relays)?)

1 Like

This would actually be doable now since the last few days, when Audrius added more metrics to the relay pool server.

Spontaneously I’d guess that the traffic will grow to swallow all available bandwidth no matter how much is added (… within reason, of course), but the question is does that matter? I think it’s understood that connections via relay will be slow and are only used as a last resort. If it syncs at 1 Mbps or 10 Mbps doesn’t really matter much from that point of view.

Does that matter?

tl;dr: It is mostly for my own curiosity and whimsy, being interesting rather than important :slight_smile:

The backstory of how I ended up thinking about this:

  • Relays are sometimes slow (I had a sync going at ~0.2Mbps)
  • Which led me to attempt to set up a private relay, on the baseless assumption that the public ones were overloaded (it would be interesting to know if the whole relay network is overloaded, or if my client just randomly selected a slow one)
  • While I was trying and failing to get a private relay working, relaying via public servers became fast (250Mbps over wifi - even faster than a 1:1 ethernet connection between the two laptops :open_mouth: I assume that ethernet was bottlenecked by the 100Mbps USB adapter…)
  • Since I already had strelay installed on my server, I decided to switch it to public mode and leave it running (the server is mostly doing I/O and CPU work, the unmetered 100Mbps network is idle, might as well donate that to a good cause)
  • Over the past week it’s been relaying 10-20Mbps, nowhere near maxed out (this is making me wonder if the relay network has a ton of spare capacity, or if there’s some other bottleneck which is stopping my server from reaching 100Mbps)
  • I’m now wondering if there would be any benefit installing strelay on my other servers, but can’t come to a conclusion, since I don’t know if my current server is genuinely idle vs bottlenecked somewhere

This is just luck, landing on a faster relay. This is last resort connections, we don’t really care how slow or fast they are.

It seems sad to not care at all; even just for the sake of making evidence-based engineering decisions, if I were running the network I would want to know if it actually IS maxed out or not, because our gut feelings and the tiny bit of data presented so far seem like they directly oppose each other…

I guess the answer to “would it be useful if I added more relays?” is that I should go along with the apathy approach and just leave the existing instance running, not bothering adding any more or killing the existing one :stuck_out_tongue:

It’s hard time say if it’s maxed out as it’s all relative to where the destination is, and there isn’t a benchmark between every possible client on how fat the pipe is. It might be maxed out between some points but not others, etc.

Can you tell if internet is maxed out? It’s usually relative to two points connecting, in some places it is, in some it isn’t.

Yeah, so the thing is that we’re not running this as our own network, or we would know if it were maxed out. :slight_smile: As Audrius says, it’s likely there are lots of bottlenecks involved, so if you have a fast relay it’s likely the limits are on the client or in other places.

I would say that if you run a relay, and it is getting used, then you are definitely adding value.

I’m scratching my head a bit about looking at relay servers as a “last resort” - I’m then assuming that ‘direct connection’ is supposed to be the nominal mode. But as I see it the majority of users are nowhere near a position to enable direct connections unless their nodes are inside their own local network. They are either unable to, or not in the position to open or set up ports on whatever is facing the internet. As for myself, I’m perfectly capable of setting up routers etc, but all the devices that I’m currently synching are most of the time in places where I have no access to that kind of configuration at all. Including, for example, my work location.

tl;dr - I, and I believe many if not most users depend on relay servers. That’s also what makes syncthing plug&play.

2 Likes

I don’t think this is true. Most home routers support UPnP, and so port forwarding will get set up automatically. Remember that only one side of the connection needs port forwarding, so someone with a device at work and one at home will probably get a direct connection as well.

Syncthing was popular before relaying was introduced. Relaying was added to support a small number of users who were genuinely unable to set up port forwarding, and to make the experience for new users better (“It works but slowly, let me improve this” is better than “It doesn’t work at all”).

Where I live (two vastly different locations, half a globe in between), double NATing is common: The ISPs connection comes with its own NATing router. I add my own home router behind that one (never mind that e.g. ASUS recommend UPnP to be disabled on their routers as they consider it a security risk). Be that as it may, because as soon as I bring my computer - which is a notebook these days - out of home, I’m on someone else’s network.

In fact I’m almost always on someone else’s network, and that’s true for the devices I sync with too. And that’s why I use syncthing in the first place - because it relays.

As I said in some other thread: For us Syncthing is a game-changer. And that’s only because of it’s relay capability. Well, ok. also due to a lot of other reasons, like open source, support and nice people. But relaying was the reason I decided to dig deeper into it.

But if you have to rely on the relay-connection you are lucky when reaching a fast relay per default. So I thought “Hey, let’s contribute!” so I added a public 10 GbE unlimited relay… with the result that I had tons of connections from china to china but my own clients went over a 150 kBps .cz relay. Due to the perfectness of Syncthing I was able to configure all clients to use my ‘own’ public relay, but other people might also end upon an ISDN relay in timbuktu. Which is definitly not “better that than nothing”.

Don’t get me wrong. I’m not talking about transfer bandwidth! I’m considering response times when adding and syncing folders. I don’t care if a 150k files/500 GB song collection got finally synced after a week or two. Let it be three, I don’t care. That’s internet. But adding a remote device or sharing a new folder takes “ages” via a slow relay to show up on the other side. That’s whats downgrading user experience.

(I feel I should prefix this by saying I love you all and I’m not trying to be argumentative, I’ve just accidentally nerd-sniped myself because the more I look at this the more interesting the puzzle becomes :slight_smile: )

I wonder if we can get some stats on how many clients use relays, rather than speculating based on anecdotes? data.syncthing.net lists how many clients have relaying /enabled/ (94% of them, no surprise since that’s the default), but no stats on how many are /using/ it…

“maxed out” is relative to two points

I’m thinking of the relay service as a whole rather than any individual connection - like my 100mbps server is currently pushing 20mbps of relay traffic, which I think contradicts our assumption of “traffic will grow to swallow all available bandwidth no matter how much is added”. Even if one client is limited to 1mbps due to their home ADSL connection, if there are thousands of clients queueing up to use relays, my server should be pushing 100mbps.

My current feeling is:

  • relaying is very rarely used as a percentage of all connections [1]
  • currently the relays have a combined capacity of 10Gbps [2]
  • currently the relays have a combined usage of 900Mbps [3]

[1] because UPnP is the standard for all home internet connections that I’ve personally experienced [2] assuming that all 112 relays are on 100mbps connections, minus ~10% on the assumption that most people who run relays do so on servers which have minimal but non-zero traffic of their own to deal with [3] curl https://relays.syncthing.net/endpoint | jq '.relays[].stats.kbps10s1m5m15m30m60m[0]' | grep -v null | awk '{ sum+=$1} END {print sum/1024}'

Although since I listed that first point, I wondered if I could do any better than “based on personal experience, my feeling is a small percentage”, so let’s see what I can figure out on the back of an envelope…

curl https://relays.syncthing.net/endpoint | jq '.relays[].stats.numActiveSessions' | grep -v null | awk '{ sum+=$1} END {print sum}'

23,000 active relay sessions right now; compared to 30,000 active users per day. I assume I must be wrong about something here, because >2/3 of daily users being connected to a relay right now seems off…

netstat -tpn | grep strelay | grep EST | sed 's/.*46.105.126.214:22067 \(.*\) ESTABLISHED.*/\1/' | cut -d ":" -f 1 | sort | uniq | wc -l

^ Lists 2000 unique IPs currently connected to my one relay.

Is it the case that every client keeps an open connection to at least one (or more?) relays, even if it isn’t using it? That would explain why “number of clients” and “number of clients connected to a relay” appear roughly equal…

Looking at my own syncthing instance running on a server with a public IP, which should never need to connect to a relay:

netstat -tpn | grep syncthing

^ Based on that output, it is connected to two clients and one relay, which seems to support “every client connects to one relay even if it doesn’t need to”.

So let’s try looking at “how many connections are actively transferring data” rather than “how many connections exist”:

cat tmux-scrollback.log | grep -v accept | grep -oE 'to .*?:' | sort | uniq -c | wc -l

^ For my one relay, I see 500 unique IPs actively receiving data in the previous 2 minutes (which is as far back as logs go, since I’m running it in tmux with 50k lines of scrollback :stuck_out_tongue: ), which seems reasonable for 2000 IPs connected, which makes it seem like maybe a large percentage of clients (>50% ?!) are actively transferring data via the relays?

Remember that relays are chosen based on latency, not which ones are closest geographically.

And any relay is “better than nothing”, surely?

Yes, you are right. But as a user it would be nice to get warned if one has to wait several minutes. I could peek in the logs, but this doesn’t help if not interpreted right. So stopping/restarting an app/program, since one thinks is ‘hangs’ because ‘normally’ it takes ‘seconds’, is almost the same as ‘nothing’.

again: yes, it should work. even slow. better than nothing.

“Connection Group Features” -> “Transport (v3)” has it: ~16% are connected via relay.

1 Like

So I made a graph of “peak observed capacity”[1] vs “currently used capacity”, and it seems to confirm that the relay network has way more capacity than needed (either that, or the relay selection algorithm is sub-optimal, and clients keep connecting to a few overloaded relays while leaving the majority idle? I wonder how to test that hypothesis…)

[1] not maximum capacity - hence why it grows over time, rather than being a horizontal line

1 Like

In many cases there are two rate limits; one per client, and one total for the relay. So a client might not see a difference in performance between being alone on a relay and being one of a hundred active clients on it, but of course the peak total will be different. (This is a good thing, I think, as it makes the performance more even and predictable.)

Again, the relay to connect to is chosen randomly among all relays with latency < 50ms. Then that connection stays unless it breaks. That’s certainly not optimal if you try to maximize throughput. Deciding who to connect to by stats also has two problems: Relays need to be honest. Otherwise a “rogue relay” can “attack” the relay network by pretending to have great stats and thus binding lots of connections to it, but not giving any throughput. And even if you can ensure that relays are honest, I don’t think the current stats are very good for optimizing for throughput and server load balancing. Many relays don’t have any explicit limits, but probably still have limitations on their servers. Also current and past data transferred isn’t a meaningful measure to decide for available throughput - a very performant server may just have a lot of inactive clients (that argument may be refutable if numbers of connections to a server is high enough, still problematic for e.g. a new server).

TLDR: I don’t think there is anything wrong with improving relay selection, even if it is intended as a last resort connection. Someone just needs to come up with a simple and stable (or close enough) method to do it.

1 Like

FWIW a method I’ve found useful in situations like this would be “pick two at random, try using both of them, discard whichever one is slowest” - then if 5% of relays are slow for whatever reason, the odds of getting stuck on a slow relay go from 5% to ~0.25%.

It’s a marginal improvement for a corner-case of a last-resort option, so it’s not exactly high priority, but the suggestion is there if someone is bored and wants to work on it anyway :slight_smile:

There’s a TOCTOU issue here as well. You might select a relay today, idle on it for three days, and then get a connection there. The performance situation of the relay can change on a minute by minute basis.