Panic: leveldb/table possibly due to ssd caching?

May2002 · May 7, 2015, 8:25pm

Hi,

two days ago I added a ssd cache to my laptop and everything seems to be fine… except Syncthing that stopped working. The first time rebooting the computer was enought but then I kept getting panic: leveldb/table: corruption on data-block (…): checksum mismatch…

I did what is advised in this case: I deleted the index and restarted Syncthing(/SyncTrayzor), it began to index everything again, and close to the end it crashed. I tried it several times, including with Syncthing 0.11.0 instead of 0.11.2 since the upgrade from 0.10 had been OK and the synced files did not change at all in the meantime.

Then as Audrius said that “the fact that it happens immediately after removing it is suspicious, as it should only happen if some writes are failing.” (issue 1297) and since the laptop is only a few weeks old and CristalDiskInfo said everything was OK, I tried to disable the caching and guess what… Syncthing works again.

I know that this is not due to a Syncthing bug but I don’t know what conclusion I should draw from it. Is the caching software (HybriDisk) bad? I thought it was simply build around intel’s RST driver… Or do these things normaly happen? Is it safe to put the cache back on?

As a lambda user I find this a little bit frightening.

Thanks for any hint you can bring…

calmh · May 7, 2015, 9:39pm

That’s very interesting (and scary). I’m not aware that the database layer does anything more magic than usual, but it’s not “in house”… I’ve previously seen similar reports for users on btrfs (which has a habit of doing weird stuff and corrupting files, not sure why people keep using it).

One thing that can corrupt databases is if writes are being reordered across synchronization points - that’s not supposed to happen, but maybe the caching layer does something funky…

May2002 · May 7, 2015, 10:03pm

Thanks for your answer, calmh, I have to check if I understood what you said:

by it’s not “in house” you mean that Syncthing uses libraries that are out of your control, right? And they would be the source of the issue (together with the caching routine, though in "enhanced mode’ - read cache only)?

By reordered across synchronization points do you mean moved from one .ldb file to another because of a changed queue order? (maybe I don’t need to really understand this point…)

Can I assume that “the caching layer” operates as they normally do, driven by IRST, or do you think that the issue is specific to my hardware/software combination?

AudriusButkevicius · May 8, 2015, 6:12am

I don’t think any reasonable caching is possible at the software level (ala I installed pc disk booster or whatever). It either has to be at the driver/kernel or at userspace filesystem given you are using one, the rest is just potentially harmful magic.

If it’s done at the driver/kernel/fuse level, then there is usually no need to install additional software to make it work, or vendor tools are already available.

calmh · May 8, 2015, 7:16am

Yes. Of course this is the case for lots of things, I just meant that I don’t fully know or understand what it does at a low level.

There’s a system call fsync() to make sure that things that have been written to a file is actually safely stored to disk. It’s fairly common in database kind of things that you write a bunch of stuff to disk, then change some pointer somewhere to point to the new data, then sync. If things are reordered somewhere so that the pointer is changed to point to the new data, but the new data isn’t written yet, that’s bad. This should of course never happen, it’s just an example of things that could possibly be screwed up by some broken middle layer doing weird things to writes going to disk.

The hybrid spinning-rust/SSD disks I’ve heard of before handle this in hardware, so not sure what it is you have there?

May2002 · May 8, 2015, 10:14am

Thank you for your detailled answers.

The laptop came with a regular 7200 rpm HDD but in other countries it was sold with a 5200 rpm HDD + SSD cache. It happens to be done with a mSATA 24 GB Sandisk U100 SSD (msata mini, a VERY small thing…) and there was a slot for this on the motherboard (though not easily reachable). Sandisk sell these only to OEM but some other manufacturers are more open-minded. The one I bought features a true SSD controller, SATA 3 connectivity… and a bundled software so that even the computer illiterates can switch their sata to RAID mode (required by the Intel RST driver) without knowing it and set up what size of cache they want (for now I choosed the whole 128 GB) and get it working. But the cache options are those given by the Intel RST driver. (MyDigitalSSD supercache 2 on amazon) It is the same software they provide with their bigger msata SSDs. Other more capable people do this without the manufacturer’s software, directly with bios related tweaks and intel’s driver (Intel’s RAID drivers page).

As it seems to work very well (except fot Syncthing) I am inclined to think that it works for real at the level it should. I will try asking to the manufacturer just in case.

Maybe my problem is not related to the SSD at all. That would be a very surprising series of coincidences but that can happen, too.

Thanks again for your support…

jpjp · May 8, 2015, 10:49am

I’ve previously seen similar reports for users on btrfs (which has a habit of doing weird stuff and corrupting files, not sure why people keep using it).

For the data block checksums! Yikes. Was this in old kernels or recent kernels?

calmh · May 8, 2015, 11:04am

No idea. It just came up recently and reminded me of another similar thing.

(Checksums are awesome; never leave home without them. ZFS does this without any brokenness, and not just for data blocks. ;)

Tom · May 8, 2015, 2:03pm

could there be common ground with this : Panic: leveldb/table: corruption on data-block ? That happened to me, also with an SSD…

May2002 · May 9, 2015, 2:48pm

Hi,

so I switched the caching on again and everything was fine. I pushed the rescan all button and it kept working fine. I stopped and restarted, it was OK, I upgraded to 0.11.2, it was OK, I stopped the computer and started it again, it was OK.

Then I stopped Syncthing/Synctrayzor, removed the index directory, and started Syncthing again. It crashed with the now usual error message.

The good news would be that it is reproductible, and that it doesn’t prevent Syncthing from working as long as it does not have to rebuild the database. I will definitely keep a eye on the caching thing!

Again, thank you a lot for your work on this great software.

@Tom: I’m not sure that it makes sense, but maybe you can try building the database on another device and see what happens. If the folders are the same you should be able to copy-paste it on your SSD without problem, it is a trick for making it work on “weak” devices, check that out on the forum first!!!

May2002 · May 10, 2015, 9:28pm

End of my story (if anyone gets into this kind of things…):

The cache software only did it… the soft way. Actually Acronis didn’t like it neither so I uninstalled it (HybriDisk, not Acronis). Everything is fine and slow again.

Thought in the end I found somewhere on my harddrive the intel RST driver and the manufacturer’s layer to do thing seriously, I don’t have the guts to do what it takes to switch sata from ahci to raid. So no SSD cache for me, but for now a nice way to accelerate my favorite stufs… until I buy a big SSD and make it my system drive (… and most probably I won’t do it myself ).

After what I have read on how caching quickly burns the ssd, and since mine would be so hard to replace, I’m not even sorry about it not working…