Crash due to checksum mismatch

collo · February 16, 2023, 4:42pm

My syncthing crashes on start, the last log messages are

[6WOWP] 17:02:33 INFO: Device ...................
[6WOWP] 17:02:34 INFO: Detected 1 NAT service
[6WOWP] 17:02:34 INFO: Relay listener (dynamic+https://relays.syncthing.net/endpoint) shutting down
[6WOWP] 17:02:34 INFO: QUIC listener ([::]:22000) shutting down
[6WOWP] 17:02:34 INFO: TCP listener ([::]:22000) shutting down
[6WOWP] 17:02:39 INFO: Failed to send failure report: Post "https://crash.syncthing.net/newcrash/failure": context deadline exceeded
[6WOWP] 17:02:39 INFO: Exiting
[6WOWP] 17:02:39 WARNING: Syncthing stopped with error: adding "..................." (_____-_____): recalculating metadata: leveldb/table: corruption on data-block (pos=3092118): checksum mismatch, want=0x598593f8 got=0x4a6cad20 [file=000081.ldb]
[monitor] 17:02:39 INFO: Syncthing exited: exit status 1
[monitor] 17:02:40 WARNING: 4 restarts in 35.853625791s; not retrying further

What I tried so far:

Remove ~/.config/syncthing (and manually set the folder-ID in order not to have to synchronize everything again → this is the same directory where the error occurs) This lead to the same problem.
I tried to see if something with my hard disk is not okay.
- fsck → no problems detected
- smartctl → “No Errors Logged”

I appreciate any help that might help me to figure what causes the problem…

tomasz86 · February 16, 2023, 5:44pm

These kind of errors usually happen due to faulty hardware. In addition, I’ve also had the database corrupted, e.g. after the OS had frozen, etc.

Do you mean that even after removing the database and starting from scratch you still get the same error? If yes, then it would probably be benificial to track down the exact file(s) that are responsible for the corruption… but only if you’re 100% sure that the hardware (and specifically the disk) is fine.

Can you provide a full SMART log (e.g. something like what’s shown in https://superuser.com/questions/1171760/how-to-determine-how-dead-a-hdd-is-from-smartctl-report)?

collo · February 16, 2023, 8:35pm

Thank you for your answer.

Yes, I removed the whole database (I suppose everything is stored in ~/.config/syncthing). However, the folders I synchronize I used from before. How could I track down the file responsible?

I am not sure whether it’s a hardware failure, but what I tested didn’t confirm it either.

The output of smartctl -a /dev/sda is:

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.10.0-18-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate BarraCuda 3.5 (SMR)
Device Model:     ST2000DM008-2FR102
Serial Number:    ----------
LU WWN Device Id: ----------
Firmware Version: 0001
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
TRIM Command:     Available
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Feb 16 21:28:47 2023 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x73) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 204) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x30a5)	SCT Status supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   076   051   006    Pre-fail  Always       -       243799542
  3 Spin_Up_Time            0x0003   098   098   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       790
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   087   060   045    Pre-fail  Always       -       456920846
  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always       -       5894h+32m+39.203s
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       1017
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       0 3 3
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   067   055   040    Old_age   Always       -       33 (Min/Max 19/33)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       71
193 Load_Cycle_Count        0x0032   098   098   000    Old_age   Always       -       4595
194 Temperature_Celsius     0x0022   033   045   000    Old_age   Always       -       33 (0 16 0 0 0)
195 Hardware_ECC_Recovered  0x001a   084   064   000    Old_age   Always       -       243799542
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       5796h+52m+00.880s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       7102046949
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       15313037684

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      5621         -
# 2  Extended offline    Interrupted (host reset)      00%      5615         -
# 3  Short offline       Completed without error       00%      5613         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

gadget · February 16, 2023, 9:57pm

collo:

Thank you for your answer.

Yes, I removed the whole database (I suppose everything is stored in ~/.config/syncthing). However, the folders I synchronize I used from before. How could I track down the file responsible?

I am not sure whether it’s a hardware failure, but what I tested didn’t confirm it either.

The output of smartctl -a /dev/sda is:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   076   051   006    Pre-fail  Always       -       243799542

Raw_Read_Error_Rate seems usually high given fewer than 6,000 power on hours, but it could just be due to the drive using SMR (Shingled Magnetic Recording).

Besides fsck and smartctl, it’s also a good idea to run a memory test (Memtest86+ is bundled with many Linux install discs – https://memtest.org/).

calmh · February 17, 2023, 6:37am

It’s not likely to be a file in particular that causes this, rather some sort of system issue. As gadget says — bad RAM, or in some cases we’ve seen issues with device drivers (on Windows). Syncthing often provoked these kind of issues, possibly because it generates a fair amount of load sometimes, and it’s quite picky about checksumming everything which is not always the norm.

tomasz86 · February 17, 2023, 7:40am

Just for the record, the crazy numbers in Raw_Read_Error_Rate and Seek_Error_Rate are normal for Seagate drives (see https://serverfault.com/questions/313649/how-to-interpret-this-smartctl-smartmon-data/495259#495259). The SMART does seem to look fine in this case indeed.

Of course, you could still do some kind of a disk surface full scan to be extra sure that the whole drive is really readable (as drives can fail with no previously reported SMART errors too).

collo · February 18, 2023, 2:25pm

I ran sudo smartctl -t offline /dev/sda a few hours ago, but I don’t know whether it’s finished. smartctl -a /dev/sda | grep -i error does not show an error:

					without error or no self-test has ever 
Error logging capability:        (0x01)	Error logging supported.
  1 Raw_Read_Error_Rate     0x000f   070   051   006    Pre-fail  Always       -       10570225
  7 Seek_Error_Rate         0x000f   087   060   045    Pre-fail  Always       -       458372132
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
SMART Error Log Version: 1
No Errors Logged
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      5621         -
# 3  Short offline       Completed without error       00%      5613         -

I also ran memtester 7Gi 10. The first two loops were ok, the third one had a very huge output, the start of it is:

Loop 3/10:
  Stuck Address       : ok         
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok         
  Block Sequential    : ok         
  Checkerboard        : ok         
  Bit Spread          : ok         
  Bit Flip            : ok         
  Walking Ones        : testing 120FAILURE: 0xffffffffffffff7f != 0xffffff7fffffff7f at offset 0x0000000070998020.
FAILURE: 0xffffffffffffff7f != 0xfffffff7ffffff7f at offset 0x0000000070998030.
FAILURE: 0xffffffffffffff7f != 0xffffffbfffffff7f at offset 0x0000000070998170.

(There are ~7000 more lines.)

I guess it really means my RAM is broken? If so, could you help me to find out which one (I have 4 with 4GiB each)?

gadget · February 18, 2023, 3:14pm

If you run a lower level test (sans operating system) using Memtest86+ (https://memtest.org/) it’ll tell you which memory module(s).

If any of the tests fail, pop out the memory modules, reinsert them into the same slots, and repeat the round of tests.

If any of the tests still fail, and it’s only a single memory module, swap places with one of the other memory modules and rerun the tests again.

If the same memory module – now in a different slot – still fails any of the tests, replace it.

collo · February 27, 2023, 5:26pm

Thank you very much for your help, it seems to really have been an issue with the RAM.

I installed memtest86+, ran sudo update-grub, ran the test with different combinations of memory slots filled and found one to be broken. I booted using the others only, removed the old ~/.config/syncthing, manually built my old setup and everythings seems to work fine now.

Again, thank you all for your help!