Connection dropped in loop with "decompressing message: lz4: invalid source or destination buffer too short"

I’ve experienced this strange problem just today. Basically, after updating a few files on one device, one of the remote devices started to disconnected when trying to sync them to it. It seems to be stuck in some kind of a loop, where the connection is established, the GUI shows the folder as “preparing to sync”, and then the connection drops with no apparent reason. This cycle appears to continue with no end.

I’m attaching debug (connections,db,model) logs from the receiving device below. Since this is a real machine and my actual files, what I did first was to pause all other devices and folders, and then ignore all files in the affected folder except for the few that seem to be affected. Hopefully this way the logs will contain only relevant information without too much unrelated noise.

I’m attaching the full log below, but this is the message that took my attention.

[XZZTD] 2022/08/02 15:17:11.825260 model.go:1730: INFO: Connection to D4DZUGP at 10.0.0.56:22000-10.0.0.51:22000/tcp-client/TLS1.3-TLS_CHACHA20_POLY1305_SHA256 closed: decompressing message: lz4: invalid source or destination buffer too short

The sending device is “D4DZUG” and the receiving device is “XZZTDIY”. Both devices are set to compress “All Data”.

syncthing.log (119.0 KB)

Edit: One piece of information that may or may not be relevant is that Syncthing on the sending side “D4DZUG” was updated from 1.20.3 to 1.20.4 while the initial transfer was still going on. The device has “Send Full Index On Upgrade” enabled, so there was a full index exchange after the upgrade. The receiving side “XZZTDIY” was still on 1.20.3 all the time. I tried to upgrade it to 1.20.4 later, but it didn’t make any difference. At the moment, both devices are running 1.20.4, and the debug log comes from this version too.

I’ve never seen it before. My initial guess would be that the connection was interrupted or tampered with (antivirus, proxies, bad drivers, bad memory, etc).

Not sure if that is possible without causing TLS errors in case of proxies/antivirus.

Bad memory or drivers are valid points though.

Thank you for the replies (and possible culprits)! There’s no antivirus running on these devices, so that’s out of the equation. When it comes to memory, there’ve never been any issues that would indicate it being problematic, so I’ve got doubts regarding this one (and the RAM itself is soldered to the motherboard on this device anyway).

However, it’s actually true that the WLAN driver is quite funky. It’s a built-in solution from Broadcom, and to be honest, it’s never been 100% stable. I think it currently runs the old driver that’s installed by default through Windows Update. I did try a few newer versions of the driver in the past, but from my memory, they actually led to the OS freezing with a bluescreen, so I ended up reverting to the default one.

I’ll try to reproduce the issue in my test environment, but if that doesn’t work, I’ll probably try installing different/newer Wi-Fi drivers on the device again. At the moment, the error prevents any synchronisation from happening between the two devices. I’ve paused the folder for now, so that everything else can work as usual.

This isn’t about the driver. I’ve managed to reproduce the problem in the test environment.

The issue seems to be related to using the i386 release of Syncthing. In the very same environment, it doesn’t occur when switching to the amd64 release. It also occurs only when using the “All Data” compression. I’ve actually narrowed it down to syncing a specific file.

I’ve opened a new issue on GitHub at https://github.com/syncthing/syncthing/issues/8484. Could this be a bug in the i386 version of the lz4 library? Do you think it could be worthwhile to open an issue on their GitHub tracker?

1 Like

Ugh, i386. I guess try to reproduce it with just the lz4 package, and if so it’s a bug there. Given the circumstances it sounds like it might be triggered by (de)compressing a large block, perhaps.

1 Like

I’ll see what comes up from testing the lz4 package separately. However, what I’ve just tried was to revert the most recent commit related to it (https://github.com/syncthing/syncthing/pull/8397), and without it the syncing works! So, it seems it’s not the whole package that’s broken but probably rather a very recent change.

Maybe related: https://github.com/pierrec/lz4/issues/189?

Edit:

I compiled lz4c from https://github.com/pierrec/lz4 locally (to be exact, I cross-compiled the i386 version under Windows x64).

I then compressed and decompressed the very same file using

.\lz4c.exe compress .\libsyncthing.so
.\libsyncthing.so.lz4 151.24%
.\lz4c.exe uncompress .\libsyncthing.so.lz4
.\libsyncthing.so.lz4 25262912

but everything went fine. Should I enable any of the subcommands for lz4c to make it run the same as what’s built into Syncthing?

Usage of lz4c:
  -version
        print the program version

Subcommands:
Compress the given files or from stdin to stdout.
compress [arguments] [<file name> ...]
  -bc
        enable block checksum
  -l int
        compression level (0=fastest)
  -sc
        disable stream checksum
  -size string
        block max size [64K,256K,1M,4M] (default "4M")

Uncompress the given files or from stdin to stdout.
uncompress [arguments] [<file name> ...]
1 Like

I’m not sure what the tool does. We do this:

There’s no simple way to test this outside of Syncthing, is it? At least I’m out of ideas on how to it (without building my own Go program for this very purpose or somehow modifying the lz4 source code).

There are only 24 commits between https://github.com/pierrec/lz4/compare/v4.1.13…v4.1.15, so it must be one of them that causes the issue. Maybe someone more skillful will be able to find the actual culprit and fix it.

I’ve now simply reverted the v4.1.15 lz4 package update and compiled Syncthing with v4.1.13 in my own builds, so at least I hope I don’t stumble on the issue again, as it happening means a complete breakage of synchronisation between the two devices.

Here’s what I would do to debug it. If the error is on decompression, add some code there to trigger on the error and write the message it’s trying to decompress to disk. Eyeball it to see if there’s something odd about it. Write a test case that just decompresses that data and see if it passes or fails on 64 or 32 bit.

For what it’s worth, compressing/decompressing large messages (32 MiB, which is larger than we ever generate) doesn’t fail on linux-386, based on this slightly modified protocol test:

func TestLZ4Compression(t *testing.T) {
	for i := 0; i < 10; i++ {
		dataLen := 32<<20 + rand.Intn(150)
		data := make([]byte, dataLen)
		for i := 0; i < dataLen-1024; i += 1024 {
			if _, err := io.ReadFull(rand.Reader, data[i:i+768]); err != nil {
				t.Fatal(err)
			}
		}

		comp := make([]byte, lz4.CompressBlockBound(dataLen))
		compLen, err := lz4Compress(data, comp)
		if err != nil {
			t.Errorf("compressing %d bytes: %v", dataLen, err)
			continue
		}

		res, err := lz4Decompress(comp[:compLen])
		if err != nil {
			t.Errorf("decompressing %d bytes to %d: %v", len(comp), dataLen, err)
			continue
		}
		if len(res) != len(data) {
			t.Errorf("Incorrect len %d != expected %d", len(res), len(data))
		}
		if !bytes.Equal(data, res) {
			t.Error("Incorrect decompressed data")
		}
		t.Logf("OK #%d, %d -> %d -> %d", i, dataLen, compLen, dataLen)
	}
}

I will try to do more testing using the code from above in the next few days. I’m still very curious why it fails only with the specific files, while everything else syncs correctly. In the test environment, I specifically used that file only (which is linked in the GitHub issue), and it failed with it immediately.

For the record, on the real device, there’re 130 GB worth of data, and yet only trying to sync the specific four files failed like that.

With almost nobody using i386, almost nobody using compress-always, and almost no files affected, it seem quite rare :slight_smile:

1 Like

Does this also affect other 32bit platforms like arm?

Yeah, that’s true, but the problem is that just one file failing like that causes the whole synchronisation to go basically kaput. I do want to give Syncthing props for still trying to sync other files in the short period when the two devices are connected before the next drop.

At the moment, no-one knows probably. The file that causes the breakage is available, so feel free to test if you’ve got access to such platforms. I can possibly test ARM32 on Android later on.

Ah, let me give it a spin. On first read of your issue I thought the libsyncthing.so was you special version of Syncthing that had the problem… :facepalm:

Oops, I’m sorry if the issue description wasn’t clear enough :crazy_face:.

libsyncthing.so is just a file that I was trying to sync, and then everything broke down. I can reproduce using the current main, so this is by no means related to my custom version :wink:.

1 Like

Yeah, it was easy to reproduce in the lz4 package, I filed a pull request with a test case. There were also a couple of other tests that failed on linux-386…

I’ve managed to narrow down the issue to this commit: