Recovering from near catastrophe


(John Statler) #1

I have two questions related to a recent near catastrophe at our office. First, the story of the catastrophe.

Most of the office’s working files are on a Linux server under the directory “Files”. That file contains sub-folders that are synced with Syncthing to home computers. Those synced sub-folders are approximately 120,000 items, totaling about 500 GB.

Someone, let us hope accidentally, moved “Files” into another folder.

That move,as you can imagine, caused havoc not only in the office (“Where did all the files go?”) but the move had repercussions all along the Syncthing chain.

Moving the “Files” folder back where it belonged took only seconds once we found out what happened to it. Diagnosing the problem took about an hour. During the hour I had started a restore of the Files from a backup disk, and you can probably imagine the hours that was going to take.

Unfortunately, the syncing process had begun on the home computers when “Files” was moved, then the files from the disk backup came into play, then an hour later the home computers saw “Files” back from where it had been moved and started doing what ever Syncthing saw as appropriate when faced with that mess.

And what a mess. The home computers started changing the files on the Linux server. I don’t even know what all happened but I do know that files disappeared from the “Files” folder that had been moved back (I guess because they were missing on the home computers?).

Once the office workers saw files on the “Files” folder that was moved back into place I thought to turn off Syncthing on the Linux server. Whew, that stopped the flood of deletions and I was able to use the 10 minute scheduled server backups to recover the missing files.

So now Syncthing is turned off at the server, and the home workers have to directly use mapped drives to the office server, which is tedious when working with 100MB pdf files. But they are toughing it out.

I have two questions:

What should I have done when the “Files” directory was found to have been moved?

What should I do now to get the home computers in sync with the office server?

Thanks in advance for any advice.

Great product. John in Oregon.


(Jakob Borg) #2

I’m guessing that the delete plus ongoing restore and simultaneous move back causes a bunch of conflicts etc and that some of this is cumbersome to resolve because the files are large and operations take a long time. Let it sync, restore the files in one place, let that sync, you should be good again.


(Simon) #3

Edit: It’s more or less what Jakob said: At that point no action was yet required, as Syncthing had detected that the subdirectories in “Files” were missing and thus stopped syncing (i.e. nothing was deleted from the home servers).
However before starting to recover from backups, you should have stopped Syncthing. Because once the Syncthing root directories and “.stignore” directories were recovered from backup and Syncthing started scanning, it saw the only partially restored folders -> deletes started propagating.

Restore on Server. Then as the mess already started propagating, start Syncthing and once it finished syncing, stop Syncthing again. Then restore the server again, just to be sure to get rid of conflicts and recreate stuff that was deleted during sync.


(uok) #4

a) what @calmh said
b) what @imsodin said
c) rethink your backup system - if the data on the server is valuable for the office but a single moved folder causes all home workers to be out of work then you should use snapshots, etc. on the server for quick restore.


(jHeScs7PRmAmJNnw) #5

Daily, or even better, hourly snapshots are a must for every filesystem hosting important live data, especially one where files are managed by a tool such Syncthing. Snapper is your friend. However, this does NOT replace taking regular backups.


#6

One other thing I’d suggest is setting a known good copy (the server?) to “Send only” ahead of resuming sync and forcing overwrite of remote conflicts. This is of course with the assumption that the remote users would not have updated the files in the meantime.


#7

What would nice here, is if st was able to do file recovery in remote clients’ trash if ever exists. This would nearly be no huge network usage. Also, at server side, a second folder share mirrored with the working one, but shared with nobody will do this quick job if remotes have a trash : when restoring from trash on remotes, server will locally pick the files in the mirrored share to populate back the working share.


(John Statler) #8

A couple of takeaways and some further questions

  • Snapshots – the Linux server is Ext4. I have to update the OS from the current Linux Mint 17.1 Rebecca, released around June 2014. I know that’s not a server edition but it works for us, getting old in the tooth though. After I update I might as well change from Ext4 to btrFS and install Snapper as suggested. And keeping my backups as well.

  • Syncthing on home computers – I’m tempted right now to have the home computers brought into the office so I can basically start from scratch on them, rather than risk any further disruption for the staff. A few of the home computers have slow Internet speed because they are out in the country side. – Unless you think that making the server a master would clean up the remove computers?

  • Preventing it from happening again – I will be locking down the primary folders so only root or particular admins can move/delete them while leaving internal files/folders manageable by users.

How does that sound? Any additional comments on what I am planning?

Thanks again. John in Oregon


(Brian) #9

I don’t think anybody else named it, so I’ll toss this into the ring. I use RSnapShot in parallel with the “server” copy of the Files directory. The rsync toolchain has been around forever and lets me keep a copy from any filesystem onto different volumes or drives. The only limit is the destination location needs to support hard links (e.g. can’t do FAT32 for compatibility reasons because it doesn’t do hard links).


(system) #10

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.