Share appears to cause other devices to panic

Atacama · May 9, 2018, 3:15am

Hello. I have used Syncthing for a few years now but am having a great deal of trouble with a new share I am trying to handle. I looked through the logs and saw references to corruption of certain data blocks, so I assumed that this meant somehow the database had been corrupted and needed to be deleted and reconstructed. While this is something I have done before on occasion (usually following power failures) I came across more with this problem than I initially anticipated. Even after deleting the database, the exact same issue resurfaced after synchronizing for a few days. I deleted the database multiple times to no avail, the corruption issue always resurfaced. After searching through these forums and github, I saw mentions that repeated corruptions could be indicative of a hardware failure (ergo dying RAM or HDDs or something similar). I thus swapped out both and tried again. Same problem. Because this seemed so odd, I added additional devices (at first only two had been synchronizing this particular share). Each device I add synchronizes this share for a few days without problems, then eventually starts panicking (resulting in endless restarts/restart loop). So far this has been tested on three different devices (admittedly all old 32‐bit linux versions). The share in question is large (between 1 and 1½ TB) and contains a significant number of hard and soft links (on files and folders/junctions). There is one master device which is in send‐only mode and contains the full copy, and the other devices are all only partially synchronized, and all start restart loops at some point in the process.

Now given the hardware is all old (besides the hard drives which are new) and the hard drives were swapped out from a similar batch, there is a very slight possibility that this is all due to the size of the share and that I am experiencing hardware failure on multiple devices, but I am hoping it might be something else. I am attaching a partial log here. If you need more, please let me know. I have censored out a few identifying names and identifiers.

Error.log (315.0 KB)

Atacama · May 9, 2018, 4:59am

Ah, and sorry, I’ve used incorrect terminology: by ‘share’ I mean ‘folder’. Was trying to translate the words in question and got tripped up.

calmh · May 9, 2018, 5:17am

panic: leveldb/table: corruption on data-block (pos=256219): checksum mismatch, want=0x39a5fc3b got=0x115b64e8 [file=037534.ldb]

The index database is toast. Can’t say whether due to hardware error, bugs, or bad luck. But you probably need to delete it and start over.

Atacama · May 9, 2018, 5:31am

That the database needs rebuilding is something I sort of guessed at, but the problem is that regardless of how many times I nuke the database and start over, the corruption reoccurs within a few days. This only happens when I deal with this particular large folder, however. All devices besides the master that has the full folder contents synchronize this folder for a few days but then become corrupted, meaning in effect that I cannot synchronize this folder with any other devices. If it helps debugging I can nuke it again and send another log with the output once it re‐corrupts itself.

calmh · May 9, 2018, 5:50am

Honestly, this smells like hardware error, e.g. bad memory. Checksum errors in the database doesn’t have a known bug cause.

Atacama · May 9, 2018, 5:56am

This is why I tried adding more devices; my line of thinking being that it would be unlikely to have bad memory in multiple boxes. In addition to nuking the database on the device I logged and sent you, I will add another more modern device with completely different hardware (also would run 64 bit software instead of the 32 bit the others are operating with). If this more modern device also fails in the same way, I think that would mean the possibility of this being a hardware failure could be discarded. If it, unlike the other devices, synchronizes to completion then I will update this post accordingly.

calmh · May 9, 2018, 6:17am

Right, I only saw the one panic. Is it exactly the same on other devices?

Atacama · May 9, 2018, 6:21am

Unfortunately I nuked all the logs on other devices. I think I can dig up the hard drive of the first device that starting causing problems as I replaced it without first wiping it. I should have that for you tomorrow, most likely. I can also run another old device that was partially synchronized before having this folder removed in order to get an additional data point. I will post these logs as I obtain them in the next couple of days.

cosas · May 9, 2018, 11:27am

Assuming the log you gave to calmh is the server one, did you try to temporarily move the 037534.ldb file, then add it back ? Is this file the culprit in all logs on the server ?

Atacama · May 9, 2018, 4:49pm

Thanks for the reply, cosas. I did not try removing and adding it back, and the log represents one server with an incomplete copy trying to synchronize with another server with the full copy. Unfortunately, I cannot try this technique at the moment as per Jakob’s indications I nuked the database and am currently waiting for this client server to reconstruct the database and attempt to proceed with synchronization. I will update this thread once additional corruption issues reappear or once I obtain the logs from the hard drive (of a different client server) which was replaced during the process of trying to identify the problem.

AudriusButkevicius · May 9, 2018, 5:14pm

You can’t just remove part of the database and expect things to work.

Atacama · May 9, 2018, 5:40pm

The client server whose database I nuked a few hours ago completed database reconstruction and panicked. I will post the full log in a few hours once I have a chance to censor some bits out, but the panic lines are as follows:

May  9 12:28:53 servername syncthing[15711]: [V6GRV] INFO: Connection to 0000005-0000005-0000005-0000005-0000005-0000005-0000005-0000005 at 172.16.1.3:33966-192.168.1.9:22000/tcp-client closed: <nil>
May  9 12:28:53 servername syncthing[15711]: panic: leveldb/table: corruption on data-block (pos=692709): checksum mismatch, want=0xc8c1d707 got=0x9f47ea81 [file=000883.ldb]
May  9 12:28:53 servername syncthing[15711]: #011panic: leveldb/table: corruption on data-block (pos=692709): checksum mismatch, want=0xc8c1d707 got=0x9f47ea81 [file=000883.ldb]
May  9 12:28:53 servername syncthing[15711]: #011panic: runtime error: invalid memory address or nil pointer dereference
May  9 12:28:53 servername syncthing[15711]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x848b332]

The only real differences I see between this and the previous log are that this time there are no connections to other devices prior to initial folder scanning being completed, and that the problematic checksum is different. Still waiting on log files from the additional client servers. For now this is the same device as the previous log file.

AudriusButkevicius · May 9, 2018, 6:01pm

So you either have a corrupt binary, or bad hardware, as this is like 3 different crashes within the same second.

Atacama · May 9, 2018, 6:18pm

Isn’t it more likely that rather than three “different” crashes happenning instantaneously, instead one single issue is causing these messages to appear in sequence like this? The same errors appeared in the same order in the last log as well.

AudriusButkevicius · May 9, 2018, 6:22pm

Potentially, but in reality it doesn’t matter as something is utterly screwed in your setup regardless.

Atacama · May 9, 2018, 6:32pm

Well I am awaiting logs on independent devices with different hardware (and binaries downloaded off different networks, even), so I will update you all once I obtain them.

Meanwhile, knowing that I have tried this already on unrelated devices, I am pondering what—if anything—could be different about this shared folder vs. all the others I’ve synchronized without problems in the past. The size is one factor, but perhaps my use of linking could be an issue? I wonder, would you happen to know how Syncthing deals with symbolic links that point to locations not on the current system? For instance, if I have Device_1 with a symbolic link A* that points to file A, and this link A* is sent to Device_2 through direct cloning (not using Syncthing) Device_2 will end up with A* but no A (which is actually what I want). If then Device_2 synchronizes with Device_3 using Syncthing, will this be handled correctly, and will Device_3 be sent A* (which may or not point to an A on its local filesystem)? I am just trying to shoot in the dark here as I don’t really know how Syncthing operates internally, but can only guess at potential issues that are unique to this shared folder when compared with my other folders.

AudriusButkevicius · May 9, 2018, 6:35pm

I think you should take off your detective hat, and just re-read what was mentioned above.

It literally crashes with memory corruption in the go runtime, nothing todo with syncthings code.

Your issue is either corrupt binary or hardware that causes corruption.

Atacama · May 9, 2018, 6:41pm

I only mention it because I thought that if A* was drilled down and digested on Device_2, it would result in one hash, whereas if the same took place on Device_3 a different hash could be produced if A was digested rather than A*. But as you say, I am not familiar with Syncthing so these are just guesses/trying to put out ideas to be helpful to you all, as you know more about this than I do.

Well if you think the only possible issues are a corrupt binary or failing hardware, I will be sure to send you the logs from the devices with different hardware just as soon as I can, but it may take another day or two. Additionally I may attempt stress tests on the drives and memory to see if such software can find faults in the hardware. Cheers and thanks for the help.

AudriusButkevicius · May 9, 2018, 6:46pm

I would start by redownloading the binary from github on the device you’ve been seeing crashes so far.

Atacama · May 9, 2018, 6:53pm

Well as I commented in the initial post, I’ve seen crashes for this folder on three different devices, but I will try what you suggested about downloading Syncthing once more on the device which is currently turned off due to being stuck in a boot loop (the one I have posted logs from). Do you think redownloading via “http://apt.syncthing.net/ syncthing release” (in my sources list on Debian) would be any more problematic than manually going in to Github? I typically just download from there.