Syncing files with same name and modification time

I haven’t dipped into the code yet, but I thought I would bring up an issue I ran into. I’m not sure if this is a known issue, or intended behavior.

If I create two files with the same name at the same time on two machines, they never get synced. On machine A I create a file with the contents “a” named x, and on machine B I create a file with the contents “b” named x. The modification time is the same to the minute when I do an ls -l. When I sync neither file gets copied to the other machine, they both just keep their original contents (a and b). When I increase the modification time of one file to be more than a minute later, then that file wins out and syncs across both machines. I would expect Syncthing to always choose a winner to prevent files from getting out of sync.

Mmmyeah this is a bug. Syncthing isn’t super great at merging existing, different file trees in that it’s nondeterministic which copy of a given file wins. Thing is, syncing is version based, where each modification to a file increases it’s version number which is kept in global sync. In this case, with just one file in each repo discovered independently on each node, you end up in the special situation that both nodes have a file named x with version=1. So they’re in sync, nothing to do!

Ideally the initial sync/scan should be a bit smarter than this… :anguished:

I think that the only solution is to compare full timestamps on first sync.

Or use a hash as the version identifier or appended to the version perhaps?

2 Likes

I agree that using a hash would be necessary for 100% reliability.

Related: What is the resolution of the time stamps? OP mentions ‘to the minute’ but I assume ctime/mtime resolution of 1 second is used?

Yes, it’s one second resolution.

With hash alone, it would be hard to detemine while is newer. It should be something like. X.Y, where X is the current-style version number, and Y is the hash.

When syncing. If Y is equal, it’s the same file. Else, if X1 > X2, and the machine with X1 used to have a file with the same hash as X2, then sync X1 on top of X2 (because it’s an updated file). Else, if X1 > X2, but the machine with X1 never had a file with the same hash as X2, then these files diverged, and there’s a conflict.

Of course, to do this, machines need to keep a history of the hashes of synced files.

I always thought it was a lamport clock…

It is. Problem is, that’s not really good enough to handle an initial sync between two devices that have the same files (name wise) and have not communicated before. Basically, what happens is

  • Device 1 starts up. Empty index, clock value = 0.

  • Device 1 sees someFile and adds it to the index. Tick the clock, the version of the file is 1.

  • Device 2 starts up. Empty index, clock value = 0.

  • Device 2 sees someFile and adds it to the index. Tick the clock, the version of the file is 1.

  • Device 1 and device 2 talk to each other. Both have someFile with version 1, so it’s all good, everything in sync. Never mind if the files are actually completely different in every aspect other than name…

A vector clock wouldn’t suffer from this, but it’s a somewhat of a pain to manage when the length of the vector (the number of participating devices) varies and a bunch of more data to shuffle per index entry. But it’s worth looking into at some point.

Sorry I know very little about this, but I always assumed that is not how lamport clocks worked?

  • Device 1 starts up. Empty index, clock value = 0.
  • Device 1 sees someFile and adds it to the index. Tick the clock, the version of the file is 1.
  • Device 2 starts up. Empty index, clock value = 0.
  • Device 2 sees someFile and adds it to the index. Tick the clock, the version of the file is 1.
  • Device 1 and device 2 talk to each other, Device 1 has version 1-device1id, Device 2 has version 1-device2id, versions don’t match, AND we have a conflict.
  • Let’s assume 1-device1id > 1-device2id.
  • Copy Device 2 copies his 1-device2id version as filename.conflict due to same version but different device ID.
  • Device 2 pulls file from Device 1 due to 1-device1id > 1-device2id, perhaps peeking at version 1-device2id for shared blocks.

That’s more of a vector clock.

I think it’s still a lamport clocks, just with total ordering? I remember reading http://web.stanford.edu/class/cs240/readings/lamport.pdf a few years ago which is why I am making these assumptions.

Vector clocks is an array of lamport clocks per node, where each node is aware of each other nodes version, but since we have a clock per file and changes to one file do not effect another file, it’s probably already vectori’ish?

You’re most likely entirely right. That aspect of the Lamport clock hasn’t survived in this implementation at least; it would need to be a {device ID, integer} pair rather than just an integer as currently. In fact, that might be good enough for syncthing’s purpose in that it detects this situation and automatically resolves (but does not detect) editing conflicts.

How do you automatically resolve this situation? If you mean by copying a file to .conflict, then it partially solves the issue with editing too, doesn’t it? I guess you could have conflicting data loss if you increment the clock twice between the index exchanges… But it’s no worse (probably better as the file actually becomes synced + conflict file produced) rather than silently assuming that the files are in sync, though in reality they are in conflict.

The “resolving” in this case is only as any other sync; we have a full ordering so one file will necessarily win over the others. But there’s no way to detect it as a conflict as far as I can see, without going to full vector clocks.

Actually, I think it also detects editing conflicts too, as you cannot increment the clock twice between the rescans? And then if the index exchanges are done on every rescan then it should work? maybe?

I guess its a bit too much for me to get my head around to work out why it wouldn’t work, but I’d welcome an example to understand the issue better.

The clock, currently, is global, and ticks on changes and receives. So when a file is changed, it’s version goes from whatever it had before to the current clock value, so like from 42 to 86247 (and then the clock is ticked). But going to a {clock_tick, device_id} tuple for the version field is an easy change and a bit better as it at least does not allow the situation of simultaneous change on two nodes looking like something is already in sync.

How reliable is the clock as far as version comparison goes?

Well, this is an extremely complex problem and I have seen its effects with BTSync.

Basically, you can not rely upon a local clock on some node, simply because his clock is not necessarily synced to the global atomic clocks. Plus, you have to consider a time zone issue. I have seen up to 30% of all nodes differing in time by +/- one hour. I have also seen the nodes with different dates, not only hours, and nodes within +/- 10 minutes are quite common.

So, assume some node modifies some file. If file mtime is stamped in UTC format and its clock is synced to the atomic clock, then it is reliable. Otherwise, its file may look in the past or future, and, if no file hash is calculated to make sure that it is indeed a different file, then file mtime can not be trusted. So, the primary method of discovering the difference between the files on different nodes is the file hash.

Basically, unless you have some global reference of either clock or version, I do not see how could you resolve this conflict in a reliable way. And to have a global reference, you need to make sure that the reference node is always on line, 100% of the time, no matter what. Otherwise, the strategy may be to simply postpone the distribution and/or update of some specific file.

The problem with sync, especially in general public situations, is that you have no idea which nodes are on or off line, and even in private networks some of your devices may be off line, at least for some periods of time.

The way, for example, BTSync does it is that it syncs only files if the time difference between the nodes is less than 10 minutes (default that can be changed). Otherwise, it simply refuses to sync ANY files, regardless of anything else. I do not know their exact logic, but I know for fact that it has some bizarre and even destructive effects.

For example, I want to screw up your repo/share. ALL I have to do is to fully download it and then replace the contents of any files in the share with anything I want, including the truncation of the file to some one line bizarre string, and after that simply run a single line command to set the mtime on all the files somewhere in the future. As soon as I do that, I can destroy any cluster with total garbage, and that could be an immense amount of information, and it isn’t even clear how do you restore the healthy state of the cluster. Which node and file version do you trust?

That is why I proposed a while back a reference node and a master node concept. In case of doubt, only the reference node can be trusted to resolve the conflicts. The master node means that even if the reference node is off line, the master node is guaranteed to have an exact copy of ALL the files. The whole node is exactly the same as the reference node.

Furthermore, when you want some or any of nodes to be able to update others, behaving like master nodes and having the write permission to other nodes, then things get even hairier than that.

And it is much easier to assure the presence of the master nodes because there may be several of them on line. The question becomes is how do you pass the master node status to some file from the reference node, or do you assign the permanent master node status regardless of any specific files? This also may become not as simple as it might look if you consider all sorts of issues.

So, basically, the r/o nodes should not be allowed to update the master nodes or the reference nodes under ANY circumstances whatsoever. Otherwise, you do not have full control over the repo and things may and are likely to go wacky at some point.

You misunderstood, when we say clock, we mean a logical lamport clock, not a physical clock.

God, have mercy on me!!! :smile:

Well, what can I say? You can just remove my message if you think it is from a different joke. At this point, I’d be curious to see how Lampert clock relates to Syncthing.

The first issue is a notion and definition of an event.

As far as sync process goes, the events are: file update, delete, rename, resurrect. Am I missing something here?

The most critical of those is file update. Is this right?

OK. So, I have the entire repo/share on my box and, while I am off line, I update some file in any way I please.

Then I come back on line and all the nodes that are on line do what with what I have?

Which version is “before” which? Which file is more “valid” then which? Which action takes priority over which? WHO is in “control” of the situation?

Some lampert clock?

On purely intuitive level, it seems to me that you can attach not only the PID, but a coconut to it, and it still won’t change even a bit in your logical consequences of file file operation. Because, as far as I can see, the events in lampert clock are considered to be something “objective”, something “valid”. But is it the issue with file versions and updates?

Sorry, after all I have done today, I am not quite “up to snuff” to deal with issue. I am afraid it will crack my scull.

All I can say at the moment is that, yes, I do tend to consider a version in the style of incremental update rather than the absolute something, like time stamp. But my brain at the moment can not comprehend the mapping between the events and their LOGICAL meaning and significance.

In other words, this particular issue is too big for the size of my scull, at least at this junction. Sorry, to interrupt your line of thought here.