A few questions about QUIC, hole punching and STUN

bt90 · March 11, 2021, 7:26pm

I quickly skimmed code+docs so correct if i’m wrong:

we currently use STUN on the QUIC port to get our external IPv4+port and detect the NAT type
if the NAT type allows hole punching we keep contacting the STUN server with a delay in order to keep the punched hole open
if we detect that the port changes we gradually reduce the delay from stunKeepaliveStartS (180s) down to stunKeepaliveMinS (20s) to handle NAT implementations with a low timeout.

My questions:

do we have anything similar for IPv6? I know that IPv6 doesn’t use NAT but we would still need to punch holes as most firewalls are going to drop incoming traffic for non-established connections
the detection mechanism for IPv4 only detects if the router firewall closed its port. A local firewall might have closed the punched hole a lot earlier. e.g Linux netfilter timeout is 30s for non-established UDP “connections”. With the default check delay of 180s the port might be closed at the device itself most of the time. It’s not a real problem as the firewall should be properly configured anyway.

Offtopic: I also noticed that the global discovery server returned duplicates for my device

calmh · March 11, 2021, 8:52pm

There’s no STUN for IPv6 but, ideally, also no NAPT. Both our quic and tcp connections should attempt connections to/from the same port numbers, hopefully sometimes resulting in some whole punching through regular stateful firewalls.

In the end we’re not designed to punch through your local firewalls. Those are under your control, you can adjust them to fit. We’re hoping to punch through the average person’s ISP-delivered lowest-bidder NAT router, using whatever techniques at hand, in most cases.

bt90 · March 11, 2021, 9:13pm

Which should work for the IPv4 case but i’m not sure that the current dialing frequency is enough to ensure the same for IPv6. Couldn’t we mimic the behavior of our STUN keepalive by reverting to a more aggressive dialing for quic6?

CGNAT(Dualstack Lite) is becoming more and more common here in Germany which means that Syncthing will only work if the other device has open ports or the user configured his router to allow IPv6 traffic.

calmh · March 11, 2021, 9:18pm

Maybe? I welcome experimentation, especially if you’re in an environment where this matters and you can actually test it out.

bt90 · March 11, 2021, 9:37pm

It should be enough to test if a connection can be established by two devices which are only configured to sync directly via quic6.

Happy to test it

imsodin · March 11, 2021, 9:56pm

I once looked into my ISP and apparently we have ds-lite too. However it’s crippled: The ipv6 really only exists to tunnel ipv4 packets to the ISP. You can’t get ipv6 ips for devices (only the modem has one). And from what I could tell that was true for all major ISPs (Switzerland). Unless the situation is different in Germany, there’s thus no potential for a direct quic6 connection at all.

bt90 · March 11, 2021, 10:17pm

That’s strange. e.g Vodafone offers a proper IPv6 prefix for their cable customers but only a CGNAT IPv4.

This is also the case on mobile networks provided by T-Mobile and Vodafone. O2 seems to be working on a rollout of IPv6.

Edit: just tested it on my phone: http://ipv6.icanhazip.com/ shows the IPv6 assigned by my mobile network provider. No NAT6 or anything like that.

imsodin · March 11, 2021, 10:25pm

Nice, I guess my ISP is just s**t then (actually I knew that already ).

bt90 · March 11, 2021, 10:39pm

Oh, you wouldn’t want to switch with the average german ISP customer We’ll have 100% IPv6 coverage before we’re even near 10% fiber(currently 4.7%…)

Alex · March 12, 2021, 8:47am

I tested it some time ago and it worked for ipv6 (openwrt on one side, router of the ISP on the other side, no exceptions configured, both dual stack). When allowing both, quic4 and quic6 it even used ipv6 more often if I remember correctly but I guess that depends on the actual firewall/NAT configurations.

bt90 · March 12, 2021, 10:25am

I think this might also depend a lot on the ISPs and routers which are involved. There’s a reason why the STUN keepalive scales its timeout down.

Ideally we would try to dial at the lower interval used for the STUN keepalive(20s). Is that already the case at the moment?

Edit: e.g openwrt defaults to 60s while the conntrack default for UDP seems to be only 30s

bt90 · March 13, 2021, 11:01am

Just did a bit of testing using my desktop and my phone:

positive: i was able to get a quic6 connection created via UDP holepunching at some point
negative: the dial delay is probably too high. At least that’s my assumption as it sometimes takes quite some time for the connection to be established. But if i start syncthing on my phone just a few seconds after my desktop dialed, the connection is established immediately.
i tested the same thing yesterday and i swear that i was not able to get a connection. Could this be traced back to both devices using the same delays and always “missing” the open hole created by the other device?
if i stop syncthing-fork via launch condition toggle, i couldn’t find any packages sent from the app to my desktop. Shouldn’t quic send something to notify the communication partner that the connection is getting closed?

The pauses between dialing attempts i observed in Wireshark are around 70s which is probably to high for firewalls to keep the hole open constantly.

Edit: i’m also starting to wonder if our test criteria for changing the STUN keepalive is really correct. We scale the delay back if our external port changes between two test runs. According to the STUN logging my router always assigns the same external port for my desktop which i configured for Syncthing(22001). And i did not enable UPnP or opened it manually. So the port might actually be closed at some point between keepalive checks and our logic wouldn’t be able to detect that.

calmh · March 13, 2021, 1:08pm

I don’t think it’s intended to? STUN is about NAT traversal and detecting how the port mappings work and if/when they change. The STUN packets aren’t related to the client connections and don’t help to keep any stateful thing open for them.

AudriusButkevicius · March 13, 2021, 1:49pm

That’s not fully true.

If there is no active connection, you gotta send some packets to maintain the mapping, otherwise, how do you expect to establish the connection.

I don’t recall where I got the number from, but I think it was recommended that something is sent via that binding every 14 seconds.

Again, I wrote the code a long time ago so this might be my imagination, but one of the common practices to do that 14 second thing, is to do test1 from stun spec, which just returns your address from the stun server, so you also detect mapping changes (incase it does change) as a side effect of trying to keep the mapping alive.

bt90 · March 13, 2021, 1:57pm

The default keepalive duration according to my config is 180s(stunKeepaliveStartS) and it scales down to 20s(stunKeepaliveMinS) if ports keep changing. At least if i understood the code correctly.

bt90 · March 13, 2021, 2:01pm

It’s probably better to use 20s by default and simply bail out if the port keeps changing regardless. This should guarantee that the hole stays open constantly.

calmh · March 13, 2021, 2:14pm

I still think we’re talking about two different things here. Yes, STUN should send often enough to keep a stable NAT mapping. We detect that by seeing if the port changed – if it did, we were too slow and we increase the rate.

But this has nothing to do with allowing connection packets in through a stateful firewall. And the “keeping the hole open” discussion refers vaguely to the combination of both things, where STUN doesn’t help with the latter at all.

bt90 · March 13, 2021, 2:32pm

@calmh your point is that even if we have a stable NAT mapping, the firewall of the router might drop incoming packets as the client has different address than the STUN server?

calmh · March 13, 2021, 2:33pm

If it does firewalling, yes.

bt90 · March 13, 2021, 2:35pm

Good point. So we can only solve this in the dialer as we would need traffic flowing to the actual clients.