Node 2 is non-NATed and behind a connection-tracking firewall which expectedly drops untracked UDP, i.e. hole punching is required. Code detects that as symmetric UDP firewall, which is correct (i.e. non-NATed). The constant in the code is misnamed, though (NATSymmetricUDPFirewall) and it’s missing from isCurrentNATTypePunchable. Adding it triggers STUN keepalives as expected.
Still, in no case I get connections apart from via relays.
Node 2 trying to reach node 1 fails for a stupid reason. Node 1 is NATed, as mentioned. Another node (node 3) is directly reachable via tcp://my_real_external_ip:22000 (port forwarding).
When node 2 tries to contact node 1, it also tries tcp://my_real_external_ip:22000 when contacting node 1, believes the connection is successful, and only then realizes it has actually contacted node 3 again and drops the extra connection, never actually trying quic to contact node 1.
Forcing node 2 to use quic://my_real_external_ip:61930 to contact node 1 actually works fine(!), but of course is not maintainable, since ports are random.
In this case, also node 1 can now contact node 2 (QUIC client/server relationship), but this should technically also work vice-versa.
So I observe a series of issues:
NAT detection seems to flap between full cone and port restricted for me.
If two nodes are in the same private network advertising tcp://my_real_external_ip:22000 and one of them is directly reachable (e.g. through port forward), the other node will never be able to be directly reached e.g. via QUIC, since Syncthing gives up after the first “successful” connection even if it reaches a different node than what was targeted.
NATSymmetricUDPFirewall is not a NAT (misnamed).
NATSymmetricUDPFirewall is not considered punchable (but it is!).
I encounter error: context deadline exceeded when trying to contact a node behind a symmetric UDP firewall via QUIC.
Now that things are in the tracking tables of all firewalls from my tests, it seems to work in both directions, even if going back to dynamic, and even when dropping the patch to have NATSymmetricUDPFirewall punchable again. It will probably break down overnight when trackings expire.
So probably, fixing the issues that:
NATSymmetricUDPFirewall is missing from the list of punchable things
a successful connection arriving at a wrong target node prevents fallback to other transports
would already improve things heavily (and at least fix my setup even though I don’t get why it did not work the other way initially).
I agree this would solve one part of the issue. However, I’d consider this a workaround at best - there are two issues with this approach:
It would mean that any mobile Syncthing-enabled device a user owns must be configured to a unique listening port (since at some point, they may be behind the same NAT temporarily).
The approach fails if there are multiple users behind the same NAT, and they can’t talk to each other.
This can be solved by QUIC, which uses the NAT router to assign different outgoing ports.
This is why I would favour a fix / improvement within Syncthing: Currently, dialing appears to be “successful” if any Syncthing node is reached, even if the dialing targeted a different node, and even if subsequently the connection is dropped again e.g. because there already is a connection to the very same node.
This prevents the fallback to different transports. Could dialing be improved to fail the given transport if the node which is reached is not the node which it wanted to contact, but a different UUID?
Note: This might as well be one of the major reasons why relays are still used heavily - a common setup for small home use of Syncthing appears to be a Raspberry Pi reachable with port forwarding, and then several clients, all on default port.
I would have guessed it is more common, but this is just a guess .
I can confirm that after choosing a different port for the directly reachable node, part 1 of the issues resolved - now machines can talk to each other directly via QUIC, as long as I add NATSymmetricUDPFirewall to the OR in isCurrentNATTypePunchable.
How should we proceed?
Should we open an issue about the suggested improvement for dialing (fail if the node to which dialing successfully connected is not the node it wanted to reach, allowing fallbacks to other transports)?
Should I open a PR adding NATSymmetricUDPFirewall to the OR in isCurrentNATTypePunchable?
In general, should we rename NATSymmetricUDPFirewall to SymmetricUDPFirewall since there is no NAT involved? However, that would require changes both to Syncthing and the go-STUN module.
@schnappi FWIW, you should get all devices to use direct connections already now if you configure the single machine that has a port forward to use a non-default port (and then forward that non-default port), leaving all other devices at default ports.
This should make all hosts outside of the NAT fail when connecting to your external IP and standard port (trying to reach devices behind the NAT), so they should use QUIC.
After the issue is fixed, it should work without any changes IIUC .
Reported the exact same thing some time ago ([Bug?] Not all addresses are dialed), for me it was just a test (all my devices can use TCP) so I did not care too much, I mainly wanted to see how well QUIC works