Unreachable global discovery server slows down connection loop

acolomb · May 22, 2023, 7:33pm

I’ve been researching a bit recently why my devices seem to take ages to find each other over the Internet, or even locally for Android devices with broken local discovery. Now I think I found the culprit and I suppose the connection loop handling should be adjusted to avoid such delays.

I once configured a discovery server on a company LAN to rendezvous with other devices on that network without internet access (blocked by firewall for the respective network segment). Added that to each of those devices, as well as to my laptop (daily-use developer machine). While on that company LAN, it worked perfectly fine. Now I’m seldomly connected to that company network, working from a different location and usually don’t have the VPN active. So the custom global discovery server address is usually unreachable.

There are 40 remote devices configured on that laptop and judging from the logs, it takes about half a minute per device before it notices the discovery server isn’t available. Lookups on the other (default) global discovery servers seem to be interspersed, but usually I only see them mentioned in the discover debug facility when the lookup fails as well (e.g. with a 404 Not Found error). I haven’t clearly identified a pattern what happens in which order, but it appears to be waiting for each remote device to fail the discovery server lookup once, only then goes to the next round of trying to connect, this time asking a different server.

I’ve already tried reducing the reconnectionIntervalS option from 60 to 10 seconds, which helped a bit, but the whole dance still takes easily 15 to 20 minutes before some connections are made. Even one remote device which has an explicit tcp://... address set in addition to dynamic won’t connect faster.

Removing the “broken” / unreachable discovery server from the config vastly speeds up the initial connections.

So apparently some things are serialized and delayed by waiting for timeout errors, where the discovery lookups should be handled in parallel. I haven’t grokked the code enough yet to pinpoint it myself, so hoping someone with better knowledge of the “connection loop” concept can explain what’s going on. I’d even classify this as a bug, because the intention is to attempt many connections in parallel, possibly replacing “worse” ones regularly. If any one of several discovery servers is unreachable, that shouldn’t significantly slow down the process, but failover / connect redundantly to the other discovery servers.

calmh · May 22, 2023, 8:08pm

From memory, the way it works is that it collects addresses in step one, then dials them concurrently (-ish, there’s some prioritization and whatnot going on) in step two. I don’t think it was anticipated that the address collection step would be time consuming. Probably the global discovery instance should be smart enough to tag the server as down after a few attempts and then return “unavailable” quickly for a while.

system · June 21, 2023, 8:08pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.