Problem with Private Relay Server (persists)

This topic is closed to replies but I experience the very same behaviour.

If I start relay with -network-timeout anything larger than 2 minutes, it is ignored and I get the EOF error like the previous OP got (2024/09/12 16:08:10 listener.go:232: Closing connection N74QTL7-GVVDDQK-OPTDSXQ-DEGHCK7-ZB7VPOS-ZIKWZJ5-5WKHG3L-2SGOVAK: EOF)

If I make it less than 2 minutes, it ends the connection with a different message (2024/09/12 15:30:51 listener.go:232: Closing connection N74QTL7-GVVDDQK-OPTDSXQ-DEGHCK7-ZB7VPOS-ZIKWZJ5-5WKHG3L-2SGOVAK: read tcp 10.x.x.x:22067->10.y.y.y:39044: use of closed network connection)

tcpdump ran on the linux host running the strelay show evidence that strelaysrv is voluntarily ending the tcp session, it is not a firewall/networking root cause; also right after line 232 in listener.go there is an explicit conn.Close()

Establishing manual netcat tcp connections over the same 22067 port remain open.

It behaves as if somewhere there has remain a hardcoded value of 2 minutes that appears to ignore the value from command line.

15:52:42.357624 IP 10.y.y.y.39528 > 10.x.x.x.22067: Flags [.], ack 1, win 251, options [nop,nop,TS val 848512897 ecr 2745209779], length 0
15:52:42.357711 IP 10.x.x.x.22067 > 10.y.y.y.39528: Flags [.], ack 1, win 503, options [nop,nop,TS val 2745224997 ecr 848498016], length 0
15:52:42.395025 IP 10.x.x.x.22067 > 10.y.y.y.39528: Flags [.], ack 1, win 503, options [nop,nop,TS val 2745225034 ecr 848498016], length 0
15:52:42.676215 IP 10.y.y.y.39528 > 10.x.x.x.22067: Flags [.], ack 1, win 251, options [nop,nop,TS val 848513229 ecr 2745224997], length 0

15:52:57.483064 IP 10.y.y.y.39528 > 10.x.x.x.22067: Flags [.], ack 1, win 251, options [nop,nop,TS val 848528189 ecr 2745224997], length 0
15:52:57.483103 IP 10.x.x.x.22067 > 10.y.y.y.39528: Flags [.], ack 1, win 503, options [nop,nop,TS val 2745240122 ecr 848513229], length 0
15:52:57.754959 IP 10.x.x.x.22067 > 10.y.y.y.39528: Flags [.], ack 1, win 503, options [nop,nop,TS val 2745240394 ecr 848513229], length 0
15:52:57.794872 IP 10.y.y.y.39528 > 10.x.x.x.22067: Flags [.], ack 1, win 251, options [nop,nop,TS val 848528510 ecr 2745240122], length 0

15:53:09.937007 IP 10.y.y.y.39528 > 10.x.x.x.22067: Flags [P.], seq 1:25, ack 1, win 251, options [nop,nop,TS val 848540648 ecr 2745240122], length 24
15:53:09.937115 IP 10.x.x.x.22067 > 10.y.y.y.39528: Flags [.], ack 25, win 503, options [nop,nop,TS val 2745252576 ecr 848540648], length 0
15:53:09.937388 IP 10.x.x.x.22067 > 10.y.y.y.39528: Flags [P.], seq 1:25, ack 25, win 503, options [nop,nop,TS val 2745252577 ecr 848540648], length 24
15:53:09.937421 IP 10.x.x.x.22067 > 10.y.y.y.39528: Flags [F.], seq 25, ack 25, win 503, options [nop,nop,TS val 2745252577 ecr 848540648], length 0
15:53:09.942953 IP 10.y.y.y.39528 > 10.x.x.x.22067: Flags [F.], seq 25, ack 1, win 251, options [nop,nop,TS val 848540648 ecr 2745240122], length 0
15:53:09.943040 IP 10.x.x.x.22067 > 10.y.y.y.39528: Flags [.], ack 26, win 503, options [nop,nop,TS val 2745252582 ecr 848540648], length 0
15:53:09.948969 IP 10.y.y.y.39534 > 10.x.x.x.22067: Flags [S], seq 1876806255, win 29200, options [mss 1205,sackOK,TS val 848540649 ecr 0,nop,wscale 7], length 0
15:53:09.949085 IP 10.x.x.x.22067 > 10.y.y.y.39534: Flags [S.], seq 3420800168, ack 1876806256, win 64422, options [mss 1205,sackOK,TS val 2745252588 ecr 848540649,nop,wscale 7], length 0
15:53:09.989130 IP 10.y.y.y.39528 > 10.x.x.x.22067: Flags [R], seq 2494221642, win 0, length 0

and the same as the OP, if I lower the ping less than 2 minutes, the connection remains stable. but I need to run this over unstable cellular data links, so 2 minutes is a bit too small. anyways, my perception of the behaviour is that the argument does not perform as documented

Did you try running with -debug to see what it says?

I can’t see anything hardcoded in the relay. Syncthing itself also has timeouts, as well as the kernel/os.

debug shows the same logs the OP from the other mentioned post; after two minutes, the relay EOF the client and sends a FIN tcp packet:

  1. with ping-interval LESS than 2 minutes, everything ok (but I want way more than 2 minutes):

relay:

strelaysrv -pools="" -ext-address=rsync:22067 -ping-interval=1m50s -network-timeout=1h -status-srv="" -protocol tcp -listen rsync:22067 -per-session-rate 1024 -debug

2024/09/18 16:53:56 listener.go:48: Listener accepted connection from 10.x.x.x:42598 tls true
2024/09/18 16:53:57 listener.go:117: Message protocol.JoinRelayRequest from N74QTL7-GVVDDQK-OPTDSXQ-DEGHCK7-ZB7VPOS-ZIKWZJ5-5WKHG3L-2SGOVAK
2024/09/18 16:55:48 listener.go:117: Message protocol.Pong from N74QTL7-GVVDDQK-OPTDSXQ-DEGHCK7-ZB7VPOS-ZIKWZJ5-5WKHG3L-2SGOVAK

client:

[N74QT] 2024/09/18 16:55:47.127047 static.go:81: DEBUG: StaticClient:0xc000264e10@relay://rsync:22067/?id=QKNYIZC-RIY7HKO-GGL6TQA-F7IK6EO-JFRIMGU-NXW7TKS-54MUSD2-3IE43QZ&networkTimeout=24h0m0s&pingInterval=1h0m0s received message protocol.Ping
[N74QT] 2024/09/18 16:55:47.127165 static.go:89: DEBUG: StaticClient:0xc000264e10@relay://rsync:22067/?id=QKNYIZC-RIY7HKO-GGL6TQA-F7IK6EO-JFRIMGU-NXW7TKS-54MUSD2-3IE43QZ&networkTimeout=24h0m0s&pingInterval=1h0m0s sent pong
  1. with ping-internal any single second larger than 2 minutes, relay ends the connection after 2 minutes, as if network-timeout was still 2 minutes

relay:

$ strelaysrv -pools="" -ext-address=rsync:22067 -ping-interval=1h -network-timeout=1h -status-srv="" -protocol tcp -listen rsync:22067 -per-session-rate 1024 -debug
2024/09/18 17:04:16 main.go:141: strelaysrv v1.22.1 "Fermium Flea" (go1.19.2 linux-amd64) teamcity@build.syncthing.net 2022-11-02 06:27:53 UTC
2024/09/18 17:04:16 main.go:147: Connection limit 3276
2024/09/18 17:04:16 main.go:186: ID: QKNYIZC-RIY7HKO-GGL6TQA-F7IK6EO-JFRIMGU-NXW7TKS-54MUSD2-3IE43QZ
2024/09/18 17:04:16 main.go:259: URI: relay://10.y.y.y:22067/?id=QKNYIZC-RIY7HKO-GGL6TQA-F7IK6EO-JFRIMGU-NXW7TKS-54MUSD2-3IE43QZ&networkTimeout=1h0m0s&pingInterval=1h0m0s&sessionLimitBps=1024
2024/09/18 17:06:37 listener.go:48: Listener accepted connection from 10.10.x.x.x:42630 tls true
2024/09/18 17:06:37 listener.go:117: Message protocol.JoinRelayRequest from N74QTL7-GVVDDQK-OPTDSXQ-DEGHCK7-ZB7VPOS-ZIKWZJ5-5WKHG3L-2SGOVAK
2024/09/18 17:06:51 listener.go:232: Closing connection N74QTL7-GVVDDQK-OPTDSXQ-DEGHCK7-ZB7VPOS-ZIKWZJ5-5WKHG3L-2SGOVAK: EOF

client:

[N74QT] 2024/09/18 17:06:36.139071 static.go:121: DEBUG: StaticClient:0xc00009a730@relay://rsync:22067/?id=QKNYIZC-RIY7HKO-GGL6TQA-F7IK6EO-JFRIMGU-NXW7TKS-54MUSD2-3IE43QZ&networkTimeout=24h0m0s&pingInterval=1h0m0s timed out
[N74QT] 2024/09/18 17:06:36.139144 static.go:174: DEBUG: StaticClient:0xc00009a730@relay://rsync:22067/?id=QKNYIZC-RIY7HKO-GGL6TQA-F7IK6EO-JFRIMGU-NXW7TKS-54MUSD2-3IE43QZ&networkTimeout=24h0m0s&pingInterval=1h0m0s disconnecting
[N74QT] 2024/09/18 17:06:36.139474 relay_listen.go:65: INFO: Relay listener (relay://rsync:22067/?id=QKNYIZC-RIY7HKO-GGL6TQA-F7IK6EO-JFRIMGU-NXW7TKS-54MUSD2-3IE43QZ&networkTimeout=24h0m0s&pingInterval=1h0m0s) shutting down
[N74QT] 2024/09/18 17:06:36.139627 svcutil.go:196: INFO: listenerSupervisor@relay://rsync:22067/?id=QKNYIZC-RIY7HKO-GGL6TQA-F7IK6EO-JFRIMGU-NXW7TKS-54MUSD2-3IE43QZ&networkTimeout=24h0m0s&pingInterval=1h0m0s: service relay://rsync:22067/?id=QKNYIZC-RIY7HKO-GGL6TQA-F7IK6EO-JFRIMGU-NXW7TKS-54MUSD2-3IE43QZ&networkTimeout=24h0m0s&pingInterval=1h0m0s failed: timed out
[N74QT] 2024/09/18 17:06:36.139748 svcutil.go:199: DEBUG: listenerSupervisor@relay://rsync:22067/?id=QKNYIZC-RIY7HKO-GGL6TQA-F7IK6EO-JFRIMGU-NXW7TKS-54MUSD2-3IE43QZ&networkTimeout=24h0m0s&pingInterval=1h0m0s: Failed service 'relay://rsync:22067/?id=QKNYIZC-RIY7HKO-GGL6TQA-F7IK6EO-JFRIMGU-NXW7TKS-54MUSD2-3IE43QZ&networkTimeout=24h0m0s&pingInterval=1h0m0s' (1.000000 failures of 2.000000), restarting: true, error: timed out
[N74QT] 2024/09/18 17:06:36.139900 relay_listen.go:57: INFO: Relay listener (relay://rsync:22067/?id=QKNYIZC-RIY7HKO-GGL6TQA-F7IK6EO-JFRIMGU-NXW7TKS-54MUSD2-3IE43QZ&networkTimeout=24h0m0s&pingInterval=1h0m0s) starting

this is clearly not kernel/os but syncthing or whatever down frameworks it uses. I can run a tcp netcat session for days (by altering the default net.ipv4.tcp_keepalive_time = 7200 to something higher)

There’s a hardcoded message timeout of exactly 2 minutes in syncthing’s static relay client code, which will incur exactly the symptoms you’re seeing:

1 Like

I noticed that too before, but I (wrongly ?) assumed it means variable time.Minute multiplied by two

1 Like

Yes, that’s correct. In golang, time.Minute represents one minute in time. Multiply that by two and you get two minutes.

great. I’ll recompile with a larger value and see what happens.

1 Like

Why are you hitting that timeout? Isn’t there a keepalive that happens more often than that?

I don’t know why that timeout fires. I explicitly asked both client and strelaysrv for much higher ping interval and network timeout, but it is ignored by strelaysrv

$ strelaysrv -pools="" -ext-address=rsync:22067 -ping-interval=1h -network-timeout=1h -status-srv="" -protocol tcp -listen rsync:22067 -per-session-rate 1024 -debug

@firesnake has increased the ping interval to > 2 minutes, so there’s no ping-pong keep alive send in a two-minute interval. This then triggers the message timeout in the client’s relay code, which expects a message at least every 2 minutes.

Sure, but we assume that the syncthing internal protocol ping for connection liveness is less than 2 minutes, which I assume you’ve modified hence this falls over.

the trigger is in the relay server, not in client. the strelaysrv is the one who ends the tcp session

Not according to this log, which clearly shows the relay client disconnecting due to hitting the 2 minute message timeout:

This also confirmed by the relay logs:

EOF indicates that the other side has closed the connection, which is consistent with logs.

The ping interval is configurable within the relay server. By default it’s 1 minute

But @firesnake has set it to values like “1 hour”

which then conflicts with the hardcoded 2 minute value in the client.

1 Like

but what I am observing with tcpdump is the strelaysrv is the one sending a FIN packet over the wire to the client.

Well, the timestamps of that event don’t seem to match the timestamps from the logs, so it’s possible that we’re seeing multiple disconnects here for different reasons. All I can say is that the logs clearly show the relay client hitting its hardcoded 2 minute timeout, which is to be expected if there isn’t a ping send in 2 minutes.

What happens if you increase the hardcoded 2 minute timeout? Does the connection then stay alive for more than 2 minutes or not?

I don’t have a go build environment available right now so I cannot rebuild :frowning:

It sends a fin as the other half of the connection hung up.