Proposal for Revision of Filesystem Transfer and Sync

cydron · March 10, 2015, 4:50am

I really think the solution here is to change the way the filesystem is modeled and exchanged. I understand this is a pain in the neck and lots of work, but I think something along what I’m describing is the best long-term solution that is ‘future proof’.

Realistically it would take a few months to overhaul, so I’m not sure whether this is a realistic expectation . I’m going to describe it anyway , since I’m not clear on whether this is feasible given resources and constraints. But I do think it would solve a bunch of ‘structural’ issues in the long term.

Forgive me if some of this is already implemented – I’m Go-illiterate at the moment, so I can’t tell what’s currently in production.

So a filesystem is a tree, right? – it can be expressed as tree Abstract Data Type where each node is either a file or folder, and each node can have a variety of attributes /metadata if necessary.

I’m really actually just proposing a Merkle Tree or Tiger Tree , something along those lines , that’s been customized for our own needs.

Hash Trees are used in the ZFS file system,[5] BitTorrent protocol, Apache Wave protocol,[6] Git distributed revision control system, the Tahoe-LAFS backup system, the Bitcoin peer-to-peer network, and a number of NoSQL systems like Apache Cassandra and Riak.[7]

Tiger tree hashes are used in Gnutella, Gnutella2, and Direct Connect P2P file sharing protocols and in file sharing applications such as Phex, BearShare, LimeWire, Shareaza, DC++[8] and Valknut.[citation needed]

Anyway – a folder might contain it’s name as metadata, on the most basic level.

Now here’s what I propose:

(1) Model any sync’d filesystem as a custom tree with files and folders.

Example of Filesystem as Tree (Files and Folders)

(2) Ensure that every node in the tree has a hash value which is dependent upon it’s own attributes and the attributes of all it’s children to an arbitrary depth.. In other words, a Merkle-Tree like data structure.

(2.0) Permit efficient serialization and de-serialization of the tree for network transport.

**Serialization of Filesystem Tree to Serialized Object Suitable for Network Transport **

(2a). Elaboration (file as a leaf node):

-Let Fi be a File. -Let Fi.data be the file’s data. -Let Fi.len be the file’s length in bytes. -Let Fi.name be the filename. -Let Fi.modified be a file modification timestamp.

-Let HASH(Fi) be a hash of a File for security and versioning purposes. -Let HASH be comprised of (minimally) a CRC64 checksum or other fast hashing algorithm, but preferably a crypographically secure hash of the form SHA1, SHA2, SHA3, RIPEMD, or otherwise.

-Let HASH preferably be a fast calculation with collision resistance. -Let HASH be SHA1 or SHA512 as the best compromise between speed and security.

-Let HASH(Fi) be a message digest of the file and (optionally) metadata properties.

-For example, HASH(Fi) = SHA1(Fi.data + Fi.len + Fi.name + Fi.modified) where ‘+’ is a concatenation operation of any form, or optionally and less preferably an XOR operation.

We can re-run the hash ONLY when the modified timestamp Fi.modified changes AND/OR when Fi.name changes , so we don’t chew up the CPU monitoring the filesystem.

Okay , so the above ensures that each file gets a hash value that reflects it’s state at a given time. This let’s us detect changes.

Next , we do the exact same operation as above, but where an Internal node of type Folder contains children of type Files or Folders. in this case, we can define…

(2b). Elaboration (folder as internal node):

-Let Fo be a Folder with 1 or more child nodes. -Let Fo.data be undefined. -Let Fo.len be undefined. -Let Fo.name be the folder name. -Let Fo.num_children be the number of child nodes of the folder (whether file or folder type). -Let Fo.modified be a folder modification timestamp.

-Let HASH(Fo) be a hash of a Folder for security and versioning purposes. -Let HASH be comprised of (minimally) a CRC64 checksum or other fast hashing algorithm, but preferably a crypographically secure hash of the form SHA1, SHA2, SHA3, RIPEMD, or otherwise.
-Let HASH preferably be a fast calculation with collision resistance.

-Let HASH be SHA1 or SHA512 as the best compromise between speed and security. -Let HASH(Fo) be a message digest of the folder, all it’s children’s hashes, and (optionally) metadata properties.

-For example, HASH(Fo) = SHA1(Fo.name + Fo.num_children + Fo.modified + HASH(Fo.child[0]) + … + HASH(Fo.child[n]) where ‘+’ is a concatenation operation of any form, or optionally and less preferably an XOR operation. Let n be the number of direct child nodes of the folder, where the folder hash is dependent on the hashes of all it’s children to any tree depth.

Anyway so that solves the issue on a local level… we represent the filesystem as a tree, and everybody gets a hash that is dependent upon file data and/or all of the nodes beneath them (whether they are files or folders).

This way, a change to a file at arbitrary depth propagates up the tree to the root node, which in turn triggers an explicit protocol exchange message of type ‘Filesystem Sync’ or something like that, with the packet data containing either (1) the hash value of the Filesystem root node for comparison purposes, and/or (2) the serialized filesystem tree in it’s entirety for more detailed analysis by a peer of ‘what’s changed in my peer’s FS relative to what I have locally in my FS’.

Here is where the controversial / non-obvious part comes in… And I don’t claim this is the best solution either , but it seems the most straightforward one to me…

INSTEAD of implicitly defining the tree in the BEP protocol, where the tree is shared as a sort of ‘nested array’ of arbitrary depth (the way it works now is the tree is sort of compressed to a FAT or vFAT style array)… But there is alot of redundancy here, and we aren’t explicitly defining hashes for EVERYTHING …

We EXPLICITLY share the serialized tree representation of the filesystem when it’s necessary to do so. This exchange now becomes it’s own message or series of messages which are specific ONLY to sharing the filesystem model among the p2p network for purposes of synchronization.

You’ll see in a moment is that I think we should have some form of hashes that are shared all the way down to the block / chunk level. But at the very least, we need it down to the node level , for all files and folders we are sync’ing.

So instead, we make an entirely new BEP message or messages, which EXPLICITLY shares the serialized tree under the following conditions:

(1) At the start of a session (2) Whenever a change is detected in the tree (meaning a hash value changes and propagates up to the root node) via a filesystem watcher, FUSE event trap, (or other mechanism, implementation not important here). (3) Optionally, at periodic intervals…

So that means we have an EXPLICIT message / packet in BEP which contains the serialized tree (along with all of the hash values). To save space / traffic, we can start by sharing the hash value of the root node, which implicitly tells any and all peers whether anything has changed in the nodes below.

If changes have occurred, the nodes then exchange serialized versions of the file system tree, and each independently evaluate and process what has changed… Each peer can determine if they have the most current version of the FS once they receive ‘tree models’ from all other peers (this is done by evaluating in-memory model of a peer’s respective filesystem vs the serialized ‘update’ packet which has arrived over the wire from one or more other peers).

The serialized filesystem packet is de-serialized and the in-memory structures are compared – but not compared by name, but ONLY by hash. Here there can be some optimizations made via traversal methods , where we can safely ignore certain child sub-trees if the hashes match.

This property makes traversal fast, even for an extremely large filesystem with many small files. The only slow part (arguably) is the initial evaluation and exchange of FS tree models.

In any case – Update message occurs – The tree is traversed, and any changes on a leaf-node (file type) level are queued to be requested as updated versions of the file by a peer. The file is divided into blocks, and requested as normal in bittorrent style.

The only issue I see here is defining who is the ‘authoritative’ source if the file changes simultaneously on two nodes, resulting in confusion as to which is the ‘official’ or ‘current’ version , since we just are comparing hashes.

I think this defines things well enough to get the idea of what I propose … There is one additional update , but this part is optional.

(3) Share block-level checksus or hashes of each file via explicit protocol messages, either as part of tree serialization or as specific messages for files.

This lets clients only request areas of the file which have actually changed… If chunks of the file remain the same, then there will be no exchange of data packets (unless another node explicitly requests for a chunk).

I guess the overall idea means moving towards modelling files based on their hash values, which forms their ‘identifier’ for purposes of data exchange and synchronization.

This process opens the door to convergent encryption , as a side effect.

****IMPLEMENTATION IN JAVA OF A FILESYSTEM TREE ****

public class MXMTree {

MXMNode root;
MXMNode commonRoot;

public MXMTree( MXMNode root ) {
    this.root = root;
    commonRoot = null;
}

public void addElement( String elementValue ) { 
    String[] list = elementValue.split("/");

    // latest element of the list is the filename.extrension
    root.addElement(root.incrementalPath, list);

public void printTree() {
    //I move the tree common root to the current common root because I don't mind about initial folder
    //that has only 1 child (and no leaf)
    getCommonRoot();
    commonRoot.printNode(0);
}

public MXMNode getCommonRoot() {
    if ( commonRoot != null)
        return commonRoot;
    else {
        MXMNode current = root;
        while ( current.leafs.size() <= 0 ) {
            current = current.childs.get(0);
        }
        commonRoot = current;
        return commonRoot;
    }

}

import java.util.ArrayList; import java.util.Arrays; import java.util.List;

public class MXMNode {

List<MXMNode> childs;
List<MXMNode> leafs;
String data;
String incrementalPath;

public MXMNode( String nodeValue, String incrementalPath ) {
    childs = new ArrayList<MXMNode>();
    leafs = new ArrayList<MXMNode>();
    data = nodeValue;
    this. incrementalPath = incrementalPath;
}

public boolean isLeaf() {
    return childs.isEmpty() && leafs.isEmpty();
}

public void addElement(String currentPath, String[] list) {

    //Avoid first element that can be an empty string if you split a string that has a starting slash as /sd/card/
    while( list[0] == null || list[0].equals("") )
        list = Arrays.copyOfRange(list, 1, list.length);

    MXMNode currentChild = new MXMNode(list[0], currentPath+"/"+list[0]);
    if ( list.length == 1 ) {
        leafs.add( currentChild );
        return;
    } else {
        int index = childs.indexOf( currentChild );
        if ( index == -1 ) {
            childs.add( currentChild );
            currentChild.addElement(currentChild.incrementalPath, Arrays.copyOfRange(list, 1, list.length));
        } else {
            MXMNode nextChild = childs.get(index);
            nextChild.addElement(currentChild.incrementalPath, Arrays.copyOfRange(list, 1, list.length));
        }
    }
}

@Override
public boolean equals(Object obj) {
    MXMNode cmpObj = (MXMNode)obj;
    return incrementalPath.equals( cmpObj.incrementalPath ) && data.equals( cmpObj.data );
}

public void printNode( int increment ) {
    for (int i = 0; i < increment; i++) {
        System.out.print(" ");
    }
    System.out.println(incrementalPath + (isLeaf() ? " -> " + data : "")  );
    for( MXMNode n: childs)
        n.printNode(increment+2);
    for( MXMNode n: leafs)
        n.printNode(increment+2);
}

@Override
public String toString() {
    return data;
}

}

public class Test {

/**
 * @param args
 */
public static void main(String[] args) {

    String slist[] = new String[] { 
            "/mnt/sdcard/folder1/a/b/file1.file", 
            "/mnt/sdcard/folder1/a/b/file2.file", 
            "/mnt/sdcard/folder1/a/b/file3.file", 
            "/mnt/sdcard/folder1/a/b/file4.file",
            "/mnt/sdcard/folder1/a/b/file5.file", 
            "/mnt/sdcard/folder1/e/c/file6.file", 
            "/mnt/sdcard/folder2/d/file7.file", 
            "/mnt/sdcard/folder2/d/file8.file", 
            "/mnt/sdcard/file9.file" 
    };

    MXMTree tree = new MXMTree(new MXMNode("root", "root"));
    for (String data : slist) {
        tree.addElement(data);
    }

    tree.printTree();
}

}

REFERENCES

[A] Plots of Hash Algorithm Speed vs Type for files

[B] Discussion and Benchmarks of Speed vs Security of Hashing algorithms

[C] Java Tree to represent File / Folder Filesystem Source Code

AudriusButkevicius · March 10, 2015, 8:47am

I think you should read the code and understand how it works and what the existing problems are before suggesting something like this. Having hash trees will require multiple roundtrips just to work out what has changed, and doesn’t solve any of the problems we have (unless I am missing something). We could have a general index hash for cheap comparison, but given we have persistent indexes, I am not sure how much benefit this provides as we already have versions to check if something is the same.

Syncthing already works simillarly to what you said, whereby it only exchanges diffs rather than whole trees after the initial exchange. And only hashes files who’s myime has changed. Storing the index will save you some space (as filepaths will be shorter) but I fell that is insignificant compared to content hashes.

cydron · March 10, 2015, 9:23pm

I am confident this would solve quite a few issues , while making new nightmares of course. . . I don’t dispute it would also be a big pain in the neck. And I don’t expect anyone else to work on a pet project that may be unclear in LOE and impact.

I don’t see this as a magic solution, but rather it iswhat the ‘state-of-the-art’ in distributed cryptographic filessystems are moving towards, especially peer-to-peer. . . Especially considering the excitement around Tahoe-LAFS and ZFS etc… Perhaps the above idea enables a move towards streaming ‘on-demand’ filesystems, block-level exchange, and enables the use of convergent encryption (perhaps eventually allowing bridging into larger networks).

Let me look more carefully at the codebase, to see what exactly would change, and if I’ve overlooked something important.

I think this is a solid idea. But I could be wrong, so I’ll have to do some more digging. Maybe I’m missing the point – that happens for sure

Having hash trees will require multiple roundtrips just to work out what has changed,

I don’t think this is true in the most literal sense – My argument is ‘what is the alternative’ ?.. I think there’d be worst-case overhead on FS changes where the traffic is proportional O(log n) if n is number of filesystem nodes… But since it’s metadata, I don’t see there being a huge problem.

Okay, let’s suppose the following…I sync my /etc directory which has 3295 files comprising 15M total. This is an example highly skewed towards many tiny files – so it represents the most ‘dangerous’ extreme in terms of protocol overhead.

Each file or folder in /etc contains it’s name, hash, and metadata (256 bytes for name + ~1024 bytes for all other metadata). Let’s say each node can take up on average 1.3k (about 1024 + 256 bytes) or thereabouts.

Full Exchange = 1.3k * 3295 folders/files = 4.2 Megs. Worst-case assumptions the metadata makes up 30% of the total data being ‘sync’d’ in this unusual example…

My wlan upload rate is somewhere around 250k/s or so personally, so that means to transmit the full tree in /etc without any compression , and with generous metadata, that takes about 16seconds.   I dont' see a huge problem there for 3200 files, but maybe the assumptions I'm making are wrong...

I think the main idea is that we ought to be sharing hashes or CRCs of all files and folders … Is this something that’s already happening? I thought this was not the case… The BEP looks more like the FS is based around a nested array and I didn’t see any hashes in the exchange.

Anyway don’t worry, there’s no expectation for others to implement this. The purpose is for brainstorming and perhaps implementation depending on what holds promise.

AudriusButkevicius · March 10, 2015, 10:16pm

This implies to me that you haven’t read or understood how BEP works. The two lines read:

BEP is used between two or more devices thus forming a cluster. Each device has one or more folders of files described by the local model, containing metadata and block hashes.

File = FileInfo FileInfo contains 0 or more BlockInfo BlockInfo contains size, hash.

Hashes over folders have no meaning, as folders don’t really have content as such, they are containers for something that has content. We don’t have a single hash over the whole file, we have a hash over each block of the file (hence Block Exchange Protocol)

cydron · March 11, 2015, 6:37am

Okay, well thars good news. Perhaps I was looking at older revision of BEP.

The advantage to giving folders hashes is that they reflect all child nodes.

Let me do this… Ill do a TLS sniff of non DH session and lets reevaluate this post after I sniff the BEP traffic.

From what you aere saying , a good plurality of what I wrote abiut may be done.

Thx for the insights Audrius and excellent work on what tiu guys have done. On my smartphone, low batt. Ttyl and thx

AudriusButkevicius · March 11, 2015, 8:48am

There is only one revision, as far as I am aware and it was always BEP and it always had hashes. Having hashes over folders is meaningless as our stuff is sent in a flat hierarchy, and each item in an array has the location of the file in the hierarchy defined by it’s path. You can reconstruct a tree from that, but in reality, there is no need, as we only sync files which means we only care about the children, and we don’t really need to know where the children are or whether that caused some folder above to change something.

At the point some file changes, we know what has changed and we only send that file, this way we skip the whole tree business and only send what’s relevant.

I suggest you really read the spec and understand how it works before making suggestions.

cydron · March 11, 2015, 6:13pm

Okay I got too much other stuff to do, I’ll get back with a reply asap.

On BEPv1: I’ve read it in depth at least a half dozen times. I don’t understand why the message ID is 12-bits and relies on packet ordering at the transport or session level to enforce a state reflected in the application.

The 12-bit message id is far too small a number, and the state (in that we want to ensure a peer is not missing packets) should be enforced via CRC32, CRC64, or HASH, not by a sequence counter modulo 2^12.

Alternative: The message id’s are random, way larger than 12 bits, generated and logged by the sender. They are added to a ‘digest’ of whatever form, then sent by Peer1 to Peer2. Next, Peer2 receives the message and sends back the message id along with an ACK – the ACK which is the peer’s current ‘digest’ of all message ids from that tcp channel for the session.

Same principle as consistency in filesystem via hashes can be applied to packets. And even if it’s not, 4096 is too low to avoid problems like collisions.

In the solution above, if someone ‘misses’ a packet it immediately shows up as a hash discrepency rather than as having to count the odometer roll over every 4096 packets (or have birthday collisions with substantial frequency).

Full post when I have time . It all relates to the idea of let’s make this scalable , and everything should be a CRC/HASH. Checksums and hashes solve problems in a sync application – the more the better to a certain point. Good luck to all coding.

AudriusButkevicius · March 11, 2015, 6:36pm

Also, now re-reading my morning post I do sound a bit sarcastic, sorry if I offended you in any way. All I wanted to say is, when in doubt, check with the spec.

AudriusButkevicius · March 11, 2015, 6:48pm

So the reason it relies on order because it sort of relies on the order the events happen… files being added, deleted, etc. To be honest, the reason we use TCP rather than UDP is because TLS requires reliable transport, which UDP is not. @calmh has written DST, which provides a reliable transport over UDP, which eventually should be used in syncthing.

To be honest, given we use TCP, we don’t really need the packet ID at all. The only reason it’s there is to be able to match the request with response as they travel around async. You cannot miss a packet with TCP nor it can arrive out of order, and I doubt you will ever have more than 4096 BEP packets (not TCP packets) in flight at one time. The only sensible time when this could happen is when you transfer data, and these are usually 128kb in size, hence 4096 is more than enough.

cydron · March 11, 2015, 9:46pm

Okay, thanks. No wrries… Frankly, the UDP is not an important feature to me at the moment… it’s just important if you want to scale big. It would be too early to add it back in now.

Right , agreed… any UDP requires built-in TCP-like functions. Not like Sequencing and Window Adjustment… But it UDP would definitely needs ACK and RESEND packets for file transfer.

Three TCP-like UDP implementations are already out there… UDT is one. Another is UFTP. I think the last is Tsunami UDP.

In terms of if we use TCP , I’d actually suggest keeping that field in there for now regardless. You can always re-use it for bit-flags and so forth.

TCP does guarantee delivery, but there’s some circumstances where this won’t work. My house is a perfect example . When I log into the city wifi, my internet randomly cuts out once every 10 to 20 minutes, and I have to manually disconnect and reconnect to the AP.

This disonnect / reconnect process screws up my (1) TCP NFS mounts, and (2) any mounted /dev/nbd (network block devices). Ironically, bittorrent is often the only protocol that doesn’t seem to care about intermittant wifi connectivity.

Poor connection (disconnect / reconnecting wifi) could probably mess up the Syncthing protocol if there is no tracking method (or timeout) , because even though I’m offline, to you it might just look like a laggy TCP stream (since I never send back either an ACK or an RST). Could this cause issues? I’m not sure.

I lookd at the Wireshark traffic earlier – The angular code looks great. Really solid. So does the web service code.

The only thing I’m unclear upon is the details of the file exchange, but I’m hoping to drop the TLS from the show pretty soon. Right now I can sniff the two VMware instances very easily. Can’t decrypt the TLS part yet, but I can see the web service calls .

Two questions…

QUESTION #1:

What does Syncthing do if the two nodes are both internal IPs behind NAT, but where they cannot talk to each other via uPnP? Does the ‘tracker’ act as a tcp bridge? What happens in this situation?

QUESTION #2

The question I do have is this … Is there a limit to fillesystem tree depth and breadth? Or is this simply parsed on the relative path string seperator?

Anyway for the message id with BEP, I don’t think it’s a big deal or needs to be modified unless either (1) You NEED to be able to audit that peers didn’t drop any traffic, and/or (2) There is a plan to start implementing UDP in the near term.

What I’d suggest doing is leave the 12-bit id in there, perhaps expand it to a random 32 bit message id value, to have the option of a CRC32 tcp-like sequence number or checksum. Even if it’s just a placeholder .

If you need ‘sequence’ to operate for reasons other than verifying ACK’d packets, the just increment ++ a randomly picked 32-bit or greater message ID – or just steal the sequence number from the tcp packet header, because I think this something similar.

The other thing is I’ve been reaading about papers on variable block sizes etc … For TCP block efficient block sizes are much larger (64k up to megs… For UDP , effiicient block sizes are smaller.

This makes TCP easier to implement and debug, which is a benefit Anyway, my 2c is do one of two things: -1. Leave the 12-bit message ID alone if it’s deprecated and unused ,repurpose it later if needed

Change to a larger message ID, populate with a random number and have any reply to a packet contain an ACK with that number.

AudriusButkevicius · March 11, 2015, 10:18pm

So it tries to perform UPnP to get an external port (and we should do NAT-PNP), if that fails, it just hopes that the user has mapped external port 22000 to internal 22000. Once we have UDP in place, we can tackle that with UDP punch-through through the discovery server. There is currently no support for relays, but with 10 lines changed in syncthing, and 40 lines of Go for the relay part, you could easily do this.
So the protocol limits the filepath at 8192 characters. On Linux the limit is 4096 and Windows it’s 255 or if you resort to hacks, longer. The filepath in the protocol is relative to the root of the folder, rather than the root of the filesystem obviously.

cydron · March 11, 2015, 10:29pm

I ask , because broadcast was disabled in vmware, but the two internal IPs on two virtual machines running Syncthing still figured out how to talk to each other. (Both had internet thru double-NAT) Maybe the syncthing announcer told them their internal IPs? A bug in vmware? Not sure, heh.

Yeah I think that’d be one advantage of UDP is the whole punching and firewall bypass ability. It is a bit of a pain write and debug tho.

Okay, cool. I think Windows progs limit this to well below 8k anyways, even tho they officially support paths up to 32K long . Linux I think the full path has to be 8k long or less. So I think 8k is a reasonable assumption to cover 99.99% of cases.