Oh no! :( big whoops

Rewt0r · March 13, 2015, 10:00pm

My first REAL issue with Syncthing since November when we began using it.

Today we had a user who’s device hadn’t connected to anyone since January running v0.10.13, he came into the office and I took a look at his machine. The database was corrupt so I deleted the index folder and after indexing I performed an upgrade.

During the few minutes that had passed before I clicked the upgrade button it was secretly replacing nearly every single file in the Global list with his outdated versions that were OLDER, 2 months older in fact (which ties in with when all the other devices last saw this device).

Caused a massive headache, managed to get all files back up until yesterday evening.

I know there are no bug fixes that would have solved this issue hence this report.

Any ideas @calmh

AudriusButkevicius · March 13, 2015, 10:30pm

There is a bug about conflict handling which might relate to this, but to be honest, I am not exactly sure why this happened.

If you scrapped the index, all files should have been version 0, and that is definitely older than last version in the global list?

I guess as a future precaution, you can temporarily enable master folder on one of the online nodes.

cydron · March 14, 2015, 2:45am

Hm… I think this is a good reminder that we are still in the experimental phase, so don’t rely on Syncthing as your one and only backup.

Rewt0r – A similar crash happened to me as well. But it occurred when I had sync’thing writing to a laggy WebDAV share with a FUSE driver of questionable quality … Versioning was not enabled. Basically I think it had to do with the network ‘going out underneath syncthing’ without it terminating the tcp connection. This also froze the mount. Not really sure what happened. Anyway, it’s probably a good idea to have a cron script, rsync, or Duplicati to take a snapshot of your syncthing data once a day.

Audrius is there any log file that he can post that tells what happened? Or maybe there should be an error reporting feature or something that traps exceptions and sends them (and all internal state information) to syncthing.net. Kind of like 'Program has crashed, send info to firefox? " Just a thought anyway. Probably more important things to do.

AudriusButkevicius · March 14, 2015, 11:17am

I don’t think any exception happen, as there are none in Go. I would be interested in seeing an integration test that reproduces this, or atleast a set of steps which reproduce this with 100% probability.

cydron · March 16, 2015, 2:35am

Yeah, I know what you mean. Last place I worked I eventually put in a button for QA to press when they found a bug (the button took a .sql and memory snapshot and sent it to me).

Rewt0r · April 8, 2015, 4:37pm

We’ve just hit this again today, luckily I’ve put an extensive backup procedure in place now.

Another user had a corrupt index, moaning that there was an incorrect CURRENT file. He was running version .10.26. Upon deleting the index directory it then began re-indexing and replacing the files with versions from 02/04/2015 even if they were newer (08/04/2015)

Can this be investigated more @calmh @AudriusButkevicius ?

The steps seem simple as they are identical both times, corrupt the CURRENT file through a non-standard shutdown or other means, leave the machine offiline for a bit but keep modifying files on there, delete index folder and wait…

AudriusButkevicius · April 8, 2015, 5:31pm

So the beta has vector clocks which should solve conflicts as such. Can you actually reproduce it manually so that we could verify that it’s fixed by vector clocks? I don’t think you need to corrupt the index to reproduce this, just make modifications and while it’s offline, then zap the index.

Zillode · April 8, 2015, 7:52pm

Would this be https://github.com/syncthing/syncthing/issues/1022 ?

Rewt0r · April 9, 2015, 12:41am

Sounds like it…

I will test over the weekend @AudriusButkevicius

calmh · April 9, 2015, 6:07am

Probably yes. Up to v0.10 the versioning uses a simple change counter, kept in the index. If two files differ, the highest change number wins. When you erase the index, you reset the change counter back to zero. So your new, reindexed node will have “older” versions of all files than the rest of the cluster, regardless of their modification time.

I’m not sure the version vectors in v0.11 will make a difference here though. They work like vector clocks so they detect conflicts, but resetting the index kind of takes that out of action.

I.e. assume you have a two device cluster, with devices A and B. A given file might have a version vector {A: 10, B: 20}. If each device makes a change to that file in parallel, A will get a vector {A: 11, B :20} and B will get {A: 10, B: 21}. These are in conflict, which we now detect and handle. Nice!

But if you reset the index on B and rescan, you will have A with {A: 11, B: 20} (the current version from above) and B with {A: 0, B: 1} (a new file with no history, discovered on B). There is no conflict here, A simply has a higher version, so the file on B will be overwritten.

Resetting the index on a device that is not in sync with the cluster is just inherently unsafe…

The alternative would be to let the cluster know that the index on B has been reset. That would make A change their vector from {A: 11, B: 20} to {A: 11, B: 0}. When B then announces {A: 0, B: 1} there is a conflict and it’ll be handled.

An easy but potentially annoying and labour intensive way of accomplishing this is to erase the keys on B when the index is reset, so it gets a new device ID… We could maybe also add something so that other nodes detect that the index announced by B is different from the B we knew before (an index creation time, say, or a unique index ID generated at creation time), and so all existing information about B (in the version vectors) should be discarded… Or the actual ID used for B in the version vectors could be derived at index creation time…

(You may notice from the above that whenever a device joins a cluster and already has files on disk, every file will be in conflict with the cluster. This is why you’ll see it as “Syncing” for a while thereafter, even if the contents are identical. The “conflicts” are handled without creating copies of the files though, as long as the conflict is just in metadata - we don’t create conflict copies of files if just the timestamp differs for example.)

calmh · April 9, 2015, 7:02am

I have an idea about solving this.

jpjp · April 9, 2015, 7:25am

@calmh Can you remind us why the file with the newest modification time doesn’t win?

Zillode · April 9, 2015, 7:38am

Either you reset:

Or you start from scratch, then both have B:0 and (atm) there is no time modification comparison between them.

calmh · April 9, 2015, 7:40am

And as for why the newest modification time doesn’t win, because it’s valid for it to move backwards if I for example overwrite a file by unpacking a previous version from a zip file or something.

calmh · April 9, 2015, 8:02am

There’s some code now on that issue to make index resets somewhat safer.

AudriusButkevicius · April 9, 2015, 8:59am

The only thing I don’t understand in this case is, how after resetting the index, the device with now empty index manages to get a higher version of the file than the one in global state?

calmh · April 9, 2015, 11:25am

Yeah, that’s odd. It could happen if the files in the rest of the cluster weren’t changed too much, and some new files were added on the broken-index device. If so, it’s possible the scenario was something like;

Cluster state (file: version):

file1: 5 file2: 6 file3: 7

Broken dudes file list:

a1 a2 a3 a4 a5 file1 file2 file3

Reset and initial scan gives them the following version numbers (as the version is taken from the “global” lamport clock, which is reset with the index…):

a1: 1 a2: 2 a3: 3 a4: 4 a5: 5 file1: 6 file2: 7 file3: 8

Hey look, each of our file1, file2 and file3 are newer than those in the cluster! Overwrite! Maybe a more likely scenario is something like

dirWith1000files: … file1: 1001 file2: 1002 file3: 1003

Yes, this is stupid.

Yes, that’s why it doesn’t work like that any more.

xHN35RQ · April 10, 2015, 12:51am

Just want to say thank you @calmh @AudriusButkevicius for working on this problem, and fixing it!! Glad to see the progress.