Epic mixup of files after disk restoration

lenlo · July 11, 2019, 4:37pm

[Note: This was originally submitted as a bug report, but Audrius was quick to close it and suggested that the support forum would be a better place for it. Thus this post. However, please note that I have already resolved the mess that Syncthing created, so I’m not actually seeking any help with my setup. I’m just posting this as a record of how things can go wrong – very wrong indeed – after a disk replacement containing Syncthing’s database and part of its synced dataset. The replacement and the restoration process from backup was without any reported errors and I have not noticed any problems with any other services or files on my server so I see not reason to believe that Syncthing’s database somehow got corrupted during the restoration. Rather, it seems that the change of disks itself confused it mightily or perhaps the database was in an incomplete state when it got backed up. If anyone has some better theories, I’m all ears. Thanks.]

This is a report of how Syncthing managed to mix up all the synced files on one of my servers and spread them to every other folder to every other instance that it was syncing with. Yes, really!

The main SSD of Mega, my Mac mini server, failed on me a few weeks ago. Fortunately, it was under warranty, so I recently received a replacement SSD and restored the contents from backup. After the backup was complete and the server was rebooted, Syncthing started up on Mega and quickly ran out of open files. Strange, I thought, it had never done that before. Still, I just up’ed the max number of open files from 256 to 1024 and let it continue.

The next day, I picked up Neo, my MacBook Pro laptop, and noticed that it was out of disk space. Huh, what had happened? A closer inspection showed that every folder that was being synced with Mega was full of files – in fact every synced folders seemed to have a copy of all the files from all the folders that Mega was syncing with any other host!

That is, if Mega had folders A, B, and C, and Neo was syncing A and B with Mega, Neo’s A and B folders now both appeared to contain all the files from A, B, and C all mixed together. The same was true for the A folders on all the other hosts that was syncing with Mega and/or Neo.

Not only has this created a mess of epic proportions, it’s also a giant security hole since the private files in B now have been distributed to a whole slew of public hosts that only were supposed to have the A files.

And as if that wasn’t bad enough, when I frantically started deleting the misplaced files and folders, Syncthing started putting them back again!

That is, when I started deleting the B files from the A folder, Syncthing promptly recreated them again! At first I thought I had made a mistake when deleting them, but when I deleted them again and listed the files in the directory, they were clearly gone – only to reappear a few seconds later.

To be fair, it seems to mostly(? only?) be the subdirectories in which the files reside, not necessarily the files themselves – I wasn’t paying full attention when it first happened, and now when I’m trying to reproduce it, I’m only seeing (empty) subdirectories being recreated. Looking at syncthing.log, I see a lot of:

[SNPSJ] 12:24:12 INFO: Puller (folder A, item “bsubdir/bfile”): no connected device has the required version of this file

This seem to cause the recreation of bsubdir.

The only solution I’ve been able to come up with is to delete all Syncthing configurations on all hosts and recreate them from scratch as just rescanning the folders seems to make no change. I’ve put aside a copy of the config files on Mega together with syncthing.log. I had a quick look at syncthing.log from after the restoration, but I couldn’t find anything obvious in it except for a whole bunch of “too many open files” errors. I can make them available if it would help, but I would first need to anonymize the contents.

Syncthing Version v1.1.4 running under macOS 10.14.5 + Ubuntu 16.04.5 LTS & 18.04.2 LTS.

calmh · July 11, 2019, 4:59pm

I didn’t read the rest fully, and I certainly don’t know what happened to you, precisely. I just want to note for posterity that restoring the database to some older point in time isn’t something you can expect to work, in general.

To begin with you need to be very sure that the dataset you restored matches the corresponding database state at the same time, precisely. If it does not, the changes from the database-described state can only be assumed to have happened since last scan. That is, if you restore the database from time T and the files from time T-1 and a file was created in that interval (and recorded in some later database version), it will look like it was deleted. (It’s present in the database but not the files on disk.)

Even if your files and your database match exactly from the same point in time … the second issue is that Syncthing doesn’t send the whole database every time it connects to another device. Instead we only send changes that have happened since last connect, based on an in-database sequence number. If you rewind that number there are now a bunch of changes you’ve sent to other devices but are no longer aware of yourself. Those peers will think you have a different set of files than you actually do. Changes you make now will also get assigned “old” sequence numbers and might not get seen by peers, because they already have the corresponding updates (they think). (There is code in place to handle this situation somewhat, but it’s not foolproof, and can’t easily be.)

This is also a situation we don’t test for. At all. So there might be further dragons lurking.

AudriusButkevicius · July 11, 2019, 5:17pm

What you described I would classify as impossible. The only way syncthing would merge the folders is if you fatfingered the paths in the config or fatfingered the ids in the config or swapped the physical location of the folder on disk, or the path now suddenly points at the root of the folder and you are accidentially syncing a parent folder which holds A, B and C to everyone (but that actually changes the structure and duplicates files).

Sure, I don’t doubt you that this happened, and it sounds terrible, but I am not sure where we go from here.

I guess logs from all sides during this would be a start, yet I suspect we might need databases to make more sense which I don’t think you can annonimise.

But in general what @calmh said is right, restoring a database is much more risky than simply rebuilding it from scratch. This is not an advised action, so I am not sure why you decided it’s a good idea to do that.

lenlo · July 11, 2019, 5:59pm

Thank you for your response, Jakob.

The backup should have been reasonably fresh as they’re made automatically every hour on my server courtesy of TimeMachine. During that (up to an) hour between the last backup and the disk failure, it is possible that a small number of files got changed in one of the synced folders, but that folder was on the same disk that failed so I think I can state for a fact that that folder’s state was brought back consistent with Syncthing’s database. The other synced folders should all have been stable as I believe no changes were made to them during this time period. If I’m wrong and there were changes made, they would have been very few and minor, so I find it hard to connect this with the massive mixup that resulted after the restoration.

I don’t know what Syncthing’s database looks like internally, but if you store things like disk UUIDs, then that would definitely have changed after the restore. Could that possibly have caused this confusion?

lenlo · July 11, 2019, 6:13pm

You may classify it as impossible, Audrius, yet it happened.

Let me assure you that I did no changes to Syncthing’s configuration either immediately before the disk failure, nor after the new disk was put into service. Nothing happened to the other disks during the downtime of server either.

The faulty disk in question simply stopped working at one point, and being the root disk for the server, the server panicked and stopped.

When I got the replacement disk several weeks later, I restored the files and rebooted the server with all disks at the same locations as before. Then I left it alone for the next ~24 hours doing other things before I discovered the complete mess that had been created in my absence.

It took me a good day or two to sort the mess out putting all the right files back in the right folders. Fortunately, I don’t think there had been any unexpected collisions or deletions to speak of.

AudriusButkevicius · July 11, 2019, 7:19pm

So was the disk restored as a whole (including config, database and data) from the borken disk, or was the data restored from the disk and the config from time machine? What’s the relation between the two?

I was under the assumption that the disk image was fully restored where everything was resident.

lenlo · July 11, 2019, 8:22pm

The replacement disk was restored from the backup. The broken disk was unreadable and could not be used for anything.

The original and the replacement disks had the same size and manufacturer and the complete disk contents were restored, so the layout and contents of the file system should have been the same before and after the backup.

However, I’m pretty sure that the restoration was done on the file level, not block level, so it’s possible that things like inode numbers could have changed. For sure the disk UUID must have changed and quite possibly the file system fsid too, I’m not sure.

AudriusButkevicius · July 11, 2019, 8:25pm

We do not use disk UUID so it’s not that. I still dont understand where time machine comes in here.

lenlo · July 11, 2019, 8:29pm

Time Machine aka backupd is the name of the macOS backup service.

It runs at regular intervals and makes full file system backups onto an external disk. With APFS on the source disk (which I was using), an atomic snapshot of the file system is made before the backup runs so that the backup’s state is guaranteed to be internally consistent.

system · August 10, 2019, 8:29pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.