Server stops receiving local announcements from other devices

Hello,

I have following setup:

  • a server running 24x7 (Arch Linux, syncthing 0.11.6) – this should be the central hub for data sync
  • a desktop PC and a notebook, both with Arch Linux and latest syncthing-gtk
  • 2 android devices

On all devices, only local discovery is enabled, UPnP and global discovery are disabled. All other settings (ports etc.) aren’t changed from the default setting.

When I make a fresh start of ST, all devices can see each other and sync the specified directories. So, the network connection is working.

My problem is the server. When I start the desktop PC or the notebook the next day, they don’t connect to the ST instance running on the server. I’ve started syncthing on the server with STTRACE=beacon,discover,net and -verbose. What I see in the log file are

discover.go:423: DEBUG: discover: Removing cached address [xxxx::xxxx:xxxx:xxxx:xxx%enp3s0]:22000 - seen 36h3m43.439290342s ago

and

beacon.go:30: DEBUG: recv 56 bytes from 192.168.1.3:21025
beacon.go:39: DEBUG: dropping message

messages. Any ideas how to further debug this?

Rasu

Sounds like a bug. Any chance there’s something else interesting in the logs between start and this? It seems like the discovery stuff may have run into a problem somewhere along the line.

Looking into the logs I couldn’t find anything interesting.

I made a simple test case with 2 devices (‘server’ and ‘desktop’), 1 shared directory, only local discovery. Only bare syncthing on both devices (not syncthing-gtk). The good (and bad) news is, that it is reproducible. Here are the logs: ‘server’ device: http://pastebin.com/91uSWhyi ’desktop’ device: http://pastebin.com/kU918q3x

Beside the server and the desktop, there were only the 2 syncthing instances on the android devices running in the network (but no sharing enabled with the test case devices). No other syncthing instances on the server or desktop.

Anything interesting would be contained in the ... here:

[TOXD7] 2015/05/30 16:59:35.601583 discover.go:423: DEBUG: discover: Removing cached address [fe80::7a24:afff:fe8a:ca08%enp3s0]:22000 - seen 23m14.657754835s ago
[TOXD7] 2015/05/30 16:59:35.601648 discover.go:423: DEBUG: discover: Removing cached address [fe80::7a24:afff:fe8a:ca08%enp3s0]:22000 - seen 23m14.657833266s ago
.
.
.
[TOXD7] 2015/05/31 07:25:36.071252 discover.go:423: DEBUG: discover: Removing cached address [fe80::7a24:afff:fe8a:ca08%enp3s0]:22000 - seen 14h49m15.127419045s ago
[TOXD7] 2015/05/31 07:25:36.071317 discover.go:423: DEBUG: discover: Removing cached address [fe80::7a24:afff:fe8a:ca08%enp3s0]:22000 - seen 14h49m15.127501784s ago

(i.e. the transition from "receiving announcement"s to "dropping message"s)

Can we have the full log please?

Sure… I cut the middle section due to space restrictions on pastebin. I skimmed through this part and couldn’t find anything of interest.

gziped, uuencoded log file: http://pastebin.com/Gsxk4FRB

Thanks. Unfortunately, no smokign gun there either. Can you, when it is in this bad state, hit it with a QUIT signal (pkill -QUIT syncthing) and paste the resulting, long stack trace? This may answer what the hell it’s doing. (Technically, the routine that processes incoming announcements seems to have died or hung. This is when you stop seeing “Received local announcement from…”. Incoming announcements are buffered for a while, until the buffer is full, then dropped. This is the “dropping message” stuff.)

Here you are:

...
[TOXD7] 2015/05/31 16:57:51.920462 discover.go:423: DEBUG: discover: Removing cached address [fe80::7a24:afff:fe8a:ca08%enp3s0]:22000 - seen 1h15m11.242772982s ago
[TOXD7] 2015/05/31 16:57:51.920521 discover.go:423: DEBUG: discover: Removing cached address [fe80::7a24:afff:fe8a:ca08%enp3s0]:22000 - seen 1h15m11.242912804s ago
SIGQUIT: quit
PC=0x4ce242

goroutine 9 [syscall, 74 minutes]:
syscall.Syscall(0x0, 0x3, 0xc20800f000, 0x1000, 0x0, 0x0, 0x0)
	/usr/lib/go/src/syscall/asm_linux_amd64.s:21 +0x5 fp=0xc208017260 sp=0xc208017258
syscall.read(0x3, 0xc20800f000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/lib/go/src/syscall/zsyscall_linux_amd64.go:867 +0x6e fp=0xc2080172a8 sp=0xc208017260
syscall.Read(0x3, 0xc20800f000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/lib/go/src/syscall/syscall_unix.go:136 +0x58 fp=0xc2080172e8 sp=0xc2080172a8
os.(*File).read(0xc20802c0a0, 0xc20800f000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/lib/go/src/os/file_unix.go:191 +0x5e fp=0xc208017328 sp=0xc2080172e8
os.(*File).Read(0xc20802c0a0, 0xc20800f000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/lib/go/src/os/file.go:95 +0x91 fp=0xc208017380 sp=0xc208017328
bufio.(*Reader).fill(0xc20805e540)
	/usr/lib/go/src/bufio/bufio.go:97 +0x1ce fp=0xc208017428 sp=0xc208017380
bufio.(*Reader).ReadSlice(0xc20805e540, 0x100a, 0x0, 0x0, 0x0, 0x0, 0x0)
	/usr/lib/go/src/bufio/bufio.go:295 +0x257 fp=0xc208017470 sp=0xc208017428
bufio.(*Reader).ReadBytes(0xc20805e540, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0)
	/usr/lib/go/src/bufio/bufio.go:374 +0xd2 fp=0xc208017590 sp=0xc208017470
bufio.(*Reader).ReadString(0xc20805e540, 0xc20802c00a, 0x0, 0x0, 0x0, 0x0)
	/usr/lib/go/src/bufio/bufio.go:414 +0x58 fp=0xc2080175e8 sp=0xc208017590
main.copyStderr(0x7effdc6a9db0, 0xc20802c0a0, 0x7effdc6a9cb0, 0xc20802c008)
	/build/syncthing/src/src/github.com/syncthing/syncthing/cmd/syncthing/monitor.go:169 +0x69 fp=0xc208017790 sp=0xc2080175e8
main.func·012()
	/build/syncthing/src/src/github.com/syncthing/syncthing/cmd/syncthing/monitor.go:113 +0x75 fp=0xc2080177e0 sp=0xc208017790
runtime.goexit()
	/usr/lib/go/src/runtime/asm_amd64.s:2232 +0x1 fp=0xc2080177e8 sp=0xc2080177e0
created by main.monitorMain
	/build/syncthing/src/src/github.com/syncthing/syncthing/cmd/syncthing/monitor.go:115 +0x1231

goroutine 1 [select, 74 minutes]:
main.monitorMain()
	/build/syncthing/src/src/github.com/syncthing/syncthing/cmd/syncthing/monitor.go:130 +0x1c33
main.main()
	/build/syncthing/src/src/github.com/syncthing/syncthing/cmd/syncthing/main.go:380 +0x21d9

goroutine 5 [syscall, 76 minutes]:
os/signal.loop()
	/usr/lib/go/src/os/signal/signal_unix.go:21 +0x1f
created by os/signal.init·1
	/usr/lib/go/src/os/signal/signal_unix.go:27 +0x35

goroutine 7 [chan receive]:
main.trackCPUUsage()
	/build/syncthing/src/src/github.com/syncthing/syncthing/cmd/syncthing/gui_unix.go:24 +0xec
created by main.init·2
	/build/syncthing/src/src/github.com/syncthing/syncthing/cmd/syncthing/gui_unix.go:17 +0x25

goroutine 10 [syscall]:
syscall.Syscall(0x0, 0x5, 0xc208078000, 0x1000, 0xa4, 0x1000, 0x0)
	/usr/lib/go/src/syscall/asm_linux_amd64.s:21 +0x5
syscall.read(0x5, 0xc208078000, 0x1000, 0x1000, 0xa4, 0x0, 0x0)
	/usr/lib/go/src/syscall/zsyscall_linux_amd64.go:867 +0x6e
syscall.Read(0x5, 0xc208078000, 0x1000, 0x1000, 0xa4, 0x0, 0x0)
	/usr/lib/go/src/syscall/syscall_unix.go:136 +0x58
os.(*File).read(0xc20802c0b0, 0xc208078000, 0x1000, 0x1000, 0xb0, 0x0, 0x0)
	/usr/lib/go/src/os/file_unix.go:191 +0x5e
os.(*File).Read(0xc20802c0b0, 0xc208078000, 0x1000, 0x1000, 0x7effdc69caf8, 0x0, 0x0)
	/usr/lib/go/src/os/file.go:95 +0x91
bufio.(*Reader).fill(0xc20805e5a0)
	/usr/lib/go/src/bufio/bufio.go:97 +0x1ce
bufio.(*Reader).ReadSlice(0xc20805e5a0, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0)
	/usr/lib/go/src/bufio/bufio.go:295 +0x257
bufio.(*Reader).ReadBytes(0xc20805e5a0, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0)
	/usr/lib/go/src/bufio/bufio.go:374 +0xd2
bufio.(*Reader).ReadString(0xc20805e5a0, 0xc20806120a, 0x0, 0x0, 0x0, 0x0)
	/usr/lib/go/src/bufio/bufio.go:414 +0x58
main.copyStdout(0x7effdc6a9db0, 0xc20802c0b0, 0x7effdc6a9cb0, 0xc20802c008)
	/build/syncthing/src/src/github.com/syncthing/syncthing/cmd/syncthing/monitor.go:210 +0x5d
main.func·013()
	/build/syncthing/src/src/github.com/syncthing/syncthing/cmd/syncthing/monitor.go:119 +0x75
created by main.monitorMain
	/build/syncthing/src/src/github.com/syncthing/syncthing/cmd/syncthing/monitor.go:121 +0x132c

goroutine 11 [semacquire, 74 minutes]:
sync.(*WaitGroup).Wait(0xc20800aac0)
	/usr/lib/go/src/sync/waitgroup.go:132 +0x169
main.func·014()
	/build/syncthing/src/src/github.com/syncthing/syncthing/cmd/syncthing/monitor.go:126 +0x4b
created by main.monitorMain
	/build/syncthing/src/src/github.com/syncthing/syncthing/cmd/syncthing/monitor.go:128 +0x1446

rax     0x0
rbx     0xc20801200c
rcx     0x4ce244
rdx     0x1000
rdi     0x3
rsi     0xc20800f000
rbp     0x3
rsp     0xc208017258
r8      0x0
r9      0x0
r10     0x0
r11     0x246
r12     0xdcb520
r13     0x0
r14     0x17
r15     0x10
rip     0x4ce242
rflags  0x246
cs      0x33
fs      0x0
gs      0x0

BTW: I’m running now syncthing 0.11.7

There are two processes, you should send sigquit to the 2nd process not the first one.

Ah. :frowning: Sorry, my instructions were bad. The above works when started with STNORESTART=1, otherwise you need to find the higher of the two pid:s belonging to syncthing processes and kill that one specifically with a -QUIT. The above is from the monitor process which is innocent in all this.

Hopefully I aimed better this time…

http://pastebin.com/KCuJK8Vg

1 Like

Yes. It’s blocked writing a packet and not reading responses. Not sure why that can happen, but we can handle it with a timeout. Thanks.

calmh,

any interest to dig deeper into this issue? If you add some additional debug facilities, I could test it.

I just tried the build system of my Linux system (Arch Build System) with the syncthing package. Shouldn’t be a problem to change the ‘source’ entry to point to another git repository with a ‘debug’ version of syncthing.

The logs cover the “how” well enough - it’ll get fixed as soon as I have some free time. You can create a github issue if you like, with the log above, as that’ll make it into the change log and give you a heads up when there’s a dev build with the fix to test.

Filed as issue #1907

1 Like

so we use separate routines for reading and writing, hence I am not sure how one can block the other?

I haven’t debugged it yet, but that’s what it looks like in the goroutine trace.

I have the same problem. Have Fedora 22 KDE x64 Linux as desktop, and FreeNAS 9.3 Syncthing plugin on my NAS server. The devices has worked couple of days, but, today, the are offline. The devices are on the same network, same LAN. i have 3 LAN’s, tested with all 3 LANS, no chance that devices come online. The LAN’s are on separate IP’s. Tryed to delete each other from list, and added it after. No way, no responds one to other. I think this is an serious problem. I have TB of data to sync, and, if they stop to discover each other, or after, if i add more devices, and they stop do discovering each other, then, i’m on big problems.

Maybe to add an force discover button, where i can forcing syncthing to discover devices? Maybe for each device separatelly, like shares scan buttons?

This needs a bugfix, not a button. That said, there’s no need for discovery once two devices are connected - they disconnected for some other reason. Maybe you had a network problem and that triggered the above issue…

Also if devices are on separate subnets, they are not on the same LAN, hence local discovery will not work.