Issues with QUIC transport

QUIC is not actively used between my nodes, while it should technically be possible. Here some more information:

  • Node 1 is behind a classic home-grade router, sometimes detected as:

    quic_listen.go:54: INFO: quic://0.0.0.0:22000 detected NAT type: Full cone NAT

    and most times detected as:

    quic_listen.go:54: INFO: quic://0.0.0.0:22000 detected NAT type: Port restricted NAT

    In any case, I get:

    quic_listen.go:73: INFO: quic://0.0.0.0:22000 resolved external address quic://my_real_external_ip:61930 (via stun.syncthing.net:3478)

  • Node 2 is non-NATed and behind a connection-tracking firewall which expectedly drops untracked UDP, i.e. hole punching is required. Code detects that as symmetric UDP firewall, which is correct (i.e. non-NATed). The constant in the code is misnamed, though (NATSymmetricUDPFirewall) and it’s missing from isCurrentNATTypePunchable. Adding it triggers STUN keepalives as expected.

Still, in no case I get connections apart from via relays.

Logs from node 1 trying to reach node 2 contain:

structs.go:220: DEBUG: dialing my_node_UUID quic://global_ip_address_of_node2:22000 error: context deadline exceeded

Node 2 trying to reach node 1 fails for a stupid reason. Node 1 is NATed, as mentioned. Another node (node 3) is directly reachable via tcp://my_real_external_ip:22000 (port forwarding). When node 2 tries to contact node 1, it also tries tcp://my_real_external_ip:22000 when contacting node 1, believes the connection is successful, and only then realizes it has actually contacted node 3 again and drops the extra connection, never actually trying quic to contact node 1.

Forcing node 2 to use quic://my_real_external_ip:61930 to contact node 1 actually works fine(!), but of course is not maintainable, since ports are random. In this case, also node 1 can now contact node 2 (QUIC client/server relationship), but this should technically also work vice-versa.

So I observe a series of issues:

  • NAT detection seems to flap between full cone and port restricted for me.
  • If two nodes are in the same private network advertising tcp://my_real_external_ip:22000 and one of them is directly reachable (e.g. through port forward), the other node will never be able to be directly reached e.g. via QUIC, since Syncthing gives up after the first “successful” connection even if it reaches a different node than what was targeted.
  • NATSymmetricUDPFirewall is not a NAT (misnamed).
  • NATSymmetricUDPFirewall is not considered punchable (but it is!).
  • I encounter error: context deadline exceeded when trying to contact a node behind a symmetric UDP firewall via QUIC.

Now that things are in the tracking tables of all firewalls from my tests, it seems to work in both directions, even if going back to dynamic, and even when dropping the patch to have NATSymmetricUDPFirewall punchable again. It will probably break down overnight when trackings expire.

So probably, fixing the issues that:

  • NATSymmetricUDPFirewall is missing from the list of punchable things
  • a successful connection arriving at a wrong target node prevents fallback to other transports

would already improve things heavily (and at least fix my setup even though I don’t get why it did not work the other way initially).

If you have two devices behind the same NAT, you should change the listening port of one of them to something else than 22000 so they don’t both try to expose the same port.

I agree this would solve one part of the issue. However, I’d consider this a workaround at best - there are two issues with this approach:

  • It would mean that any mobile Syncthing-enabled device a user owns must be configured to a unique listening port (since at some point, they may be behind the same NAT temporarily).
  • The approach fails if there are multiple users behind the same NAT, and they can’t talk to each other.

This can be solved by QUIC, which uses the NAT router to assign different outgoing ports.

This is why I would favour a fix / improvement within Syncthing: Currently, dialing appears to be “successful” if any Syncthing node is reached, even if the dialing targeted a different node, and even if subsequently the connection is dropped again e.g. because there already is a connection to the very same node. This prevents the fallback to different transports. Could dialing be improved to fail the given transport if the node which is reached is not the node which it wanted to contact, but a different UUID?

Note: This might as well be one of the major reasons why relays are still used heavily - a common setup for small home use of Syncthing appears to be a Raspberry Pi reachable with port forwarding, and then several clients, all on default port.

Sure this can be improved but I don’t think this has anything todo with low numbers of quic usage.

Having two devices behind a nat, one forwarded via the nat, the other one not, is an exception, not the norm.

I would have guessed it is more common, but this is just a guess :wink:.

I can confirm that after choosing a different port for the directly reachable node, part 1 of the issues resolved - now machines can talk to each other directly via QUIC, as long as I add NATSymmetricUDPFirewall to the OR in isCurrentNATTypePunchable.

How should we proceed?

  • Should we open an issue about the suggested improvement for dialing (fail if the node to which dialing successfully connected is not the node it wanted to reach, allowing fallbacks to other transports)?
  • Should I open a PR adding NATSymmetricUDPFirewall to the OR in isCurrentNATTypePunchable?
  • In general, should we rename NATSymmetricUDPFirewall to SymmetricUDPFirewall since there is no NAT involved? However, that would require changes both to Syncthing and the go-STUN module.

You should open an issue about dialing checking the device id before assuming success on the connection.

You should open a PR adding it in.

We just copy the names from the underlying library so I don’t think it’s for us to decide the names.

Will do, thanks!

I’ll then also open an issue against the underlying library to ask about the confusing name (and whether they would rename it).

Have multiple devices behind NAT. One port forwarded and others machines previously using relay. Too much work to set custom ports and open these ports on router when relay works fine.

Since QUIC came out noticed that one machine no longer requires a relay connection. Just adding that it seems to work really well.

Also wondered why only one machine could connect via QUIC at a time, but not worrying about it. QUIC seems to result in one more direct connection without any configuration.

Thanks for confirming my setup is not so rare :wink:. I opened the issue here:

and once this is solved, it should also help your case without any necessary manual configuration.

For reference, the other things discussed here are the already merged PR:

(thanks, @AudriusButkevicius !) and the issue filed against go-stun:

@schnappi FWIW, you should get all devices to use direct connections already now if you configure the single machine that has a port forward to use a non-default port (and then forward that non-default port), leaving all other devices at default ports.

This should make all hosts outside of the NAT fail when connecting to your external IP and standard port (trying to reach devices behind the NAT), so they should use QUIC. After the issue is fixed, it should work without any changes IIUC :wink:.

Reported the exact same thing some time ago ([Bug?] Not all addresses are dialed), for me it was just a test (all my devices can use TCP) so I did not care too much, I mainly wanted to see how well QUIC works :wink:

Really seems it’s not so rare, then :wink:. The PR by @AudriusButkevicius here:

fixes things perfectly for me, now all addresses are dialed and QUIC is successfully used :smile:.