I’ve been researching a bit recently why my devices seem to take ages to find each other over the Internet, or even locally for Android devices with broken local discovery. Now I think I found the culprit and I suppose the connection loop handling should be adjusted to avoid such delays.
I once configured a discovery server on a company LAN to rendezvous with other devices on that network without internet access (blocked by firewall for the respective network segment). Added that to each of those devices, as well as to my laptop (daily-use developer machine). While on that company LAN, it worked perfectly fine. Now I’m seldomly connected to that company network, working from a different location and usually don’t have the VPN active. So the custom global discovery server address is usually unreachable.
There are 40 remote devices configured on that laptop and judging from the logs, it takes about half a minute per device before it notices the discovery server isn’t available. Lookups on the other (default) global discovery servers seem to be interspersed, but usually I only see them mentioned in the discover
debug facility when the lookup fails as well (e.g. with a 404 Not Found
error). I haven’t clearly identified a pattern what happens in which order, but it appears to be waiting for each remote device to fail the discovery server lookup once, only then goes to the next round of trying to connect, this time asking a different server.
I’ve already tried reducing the reconnectionIntervalS
option from 60 to 10 seconds, which helped a bit, but the whole dance still takes easily 15 to 20 minutes before some connections are made. Even one remote device which has an explicit tcp://...
address set in addition to dynamic
won’t connect faster.
Removing the “broken” / unreachable discovery server from the config vastly speeds up the initial connections.
So apparently some things are serialized and delayed by waiting for timeout errors, where the discovery lookups should be handled in parallel. I haven’t grokked the code enough yet to pinpoint it myself, so hoping someone with better knowledge of the “connection loop” concept can explain what’s going on. I’d even classify this as a bug, because the intention is to attempt many connections in parallel, possibly replacing “worse” ones regularly. If any one of several discovery servers is unreachable, that shouldn’t significantly slow down the process, but failover / connect redundantly to the other discovery servers.