Syncing where one node is on VPN not working after restart

Hi,

I’ll try to explain this in a simple way without complicating it too much.

Server A

  • VPS

Server B

  • FreeNAS running bsd jail.
  • Connected to a openvpn access server and routing all traffic through the tunnel.
  • No ports open in router for this local ip (but running another unrelated syncthing instance that works so port 22000 is open although it shouldn’t affect this as the ip is changed due to the vpn).

Server A is pushing files one-way to Server B, and it seems to be working (once it got as far as transferring 150gb before I restarted it) as long as syncthing hasn’t been restarted (due to crash/whatever) on either of the two servers. If the connection is broken, the console of server b is filled with closed: EOF or closed: ping timeout and the console of server a server a is filled with closed: ping timeout or connection reset by peer. VPN connection is not dropping, and those messages are appearing approximately every 2 minutes.

Restarting doesn’t work, but removing the config files and do a completely new install seems to be working, until I restart syncthing where the error messages above starts to appear again.

Both servers are running 0.9.0 but I’ve also tried the go build, same issues.

I’m not sure why this happens and what I can do about it. Seeing as the other, unrelated, syncthing setup is working I’m led to believe it has to do with the vpn – but then again, I really don’t know.

Thank you in advance.

Tricky. What you could do is run one side with STTRACE=net,protocol syncthing. This is going to generate a lot of data. When the situation above occurs, that log data for the time between one disconnect to the next (i.e. covering a node connecting, then timing out or getting reset and connecting again) might show what’s going on.

When the connection is up and working, does it transfer data very slowly or does it seem to be able to shuffle data over the VPN ok?

Thank you for the reply @calmh,

It maxes my connection (home is 10/10 and vps is far higher). I limit it to 5mbit/s on server a as I want to be able to do other things from home at the same time. It keeps steady at whatever I choose until restart.

I’m running STTRACE on server a right now. Is there something specific you are looking for? I found this:

  • DEBUG: ADUP5DL[…] ping skipped after rd 59.783147873s
  • DEBUG: read header {0 3 4 false} (msglen=0)
  • DEBUG: read 0 bytes
  • DEBUG: message data:
  • DEBUG: ADUP5DL[…] ping skipped after rd 29.828208419s
  • DEBUG: ADUP5DL[…] ping skipped after rd 59.8281677s
  • DEBUG: connect from VPN-IP:55822
  • INFO: Connected to already connected node (ADUP5DL[…])

After writing this, the same thing has happened again after displaying loads of hexa code, always 29s then 59s. In the middle of everything I occasionally get that its trying to push files and then it gets ping skipped.

On sever b, STTRACE gives a lot of ping timeout, but here’s a few lines that seems to happen over and over again.

  • DEBUG: dial JMPIBZE[…] SERVER-A-IP:22000
  • INFO: Established secure connection to JMPIBZE[…] at 5.5.8.5:20699-SERVER-A-IP:22000
  • DEBUG: cipher suite C02F
  • DEBUG: write compressed message; {0 1 0 true} (len=104)
  • DEBUG: wrote 112 bytes on the wire
  • DEBUG: write compressed message; {0 2 1 true} (len=829)
  • DEBUG: wrote 837 bytes on the wire
  • DEBUG: JMPIBZE[…] ping skipped after wr 30.011017658s
  • DEBUG: JMPIBZE[…] ping ->
  • DEBUG: write empty message; {0 3 4 false}
  • DEBUG: wrote 8 bytes on the wire
  • INFO: Connection to JMPIBZE[…] closed: ping timeout

Few things:

  • I tried turning off compression and restarting and it didn’t help.
  • It seems to create a connection with 5.5.8.5. That’s a local openvpn IP, right? Pinging that can’t possibly work. Could it be syncthing tries to send files over the wrong interface? Ifconfig shows that 5.5.8.5 is used by tun0 on server b. On the VPN as0t4 has 5.5.8.5 and br0 has the real external ip.
  • Is it possible that it hangs up when trying to transfer files thats already there on the receiver, albeit incomplete versions?

I’ll need the full logs, with timestamps, hex data and all - server b looks most interesting. 5.5.8.5 is the local IP of server b, apparently.

Alright, buddy! I’ll let it run for a few minutes, and then send it in a pm to you due to privacy concerns. I hope that’s okay :smile:

Thanks for the logs. It looked from those like you had built syncthing yourself; can you retry with the latest master (update both sides)?

Apparently I’m stupid to yet again use an unknown build (fresh go get from github) and while it did work at first it stopped after I set a limitation on upload speed.

Today I restarted with http://build.syncthing.net/job/syncthing/lastSuccessfulBuild/artifact/ as you suggested in PM and it started up fine and works flawlessly. Seems a lot more snappy as well.

Seeing as I’m interested, what was the cause of this issue?

Syncthing uses ping/pong packets to check if the peer is alive when there is no recent activity. “Activity” in this case is defined as us having received or sent a BEP message to/from the other node. The activity timestamp is only updated once we have received a complete message (not for every byte along the way). This isn’t a problem normally, because the message size is limited - a single message only contains a limited number of index entries, and only up to 128 KiB of file data.

However, a “limited number of index entries” was hardcoded at 1000. The index entry contains the list of block hashes for the file in question. If the files are very large, the block lists are too. So there’s a relevant difference between sending an index message describing 1000 text files and a message describing 1000 VM images. In your case, it was closer to the latter.

Combine this with a slower uplink, and you can have a message stuck in the outbox (being sent) for several minutes. The ping mechanism will kick in, but the ping is queued behind the large message, so the response isn’t received in time. The connection is closed.

The new code takes message size into account when building index messages, correctly limiting their size.

The relevant clue from your log file was

2014/08/07 16:07:11.518353 protocol.go:511: DEBUG: write uncompressed message; {0 2 1 false} (len=95055492)

That’s a 95 MB message… Enabling compression would probably be a win as well.

Hi Calmh i’m sorry to bother you but i’m kind of a nooby on this could you please explain to me to do the STTRACE=net,protocol syncthing ? THank you very much for your time =)