Yeah, sorry about that. Those posts were absolutely unreadable, thanks to a repeatedly disconnecting wifi adapter and broken keyboard. I went back and edited for clarity.
Okay, I’ve written up the main idea of what I’m proposing here…
https://forum.syncthing.net/t/proposal-for-revision-of-filesystem-transfer-and-sync/2056/1
Click the above link for the details as far as I’ve thought things out anyway.
But to summarize…
I think some adaptation of a new method of the modelling of the filesystem and the exchange of FS-related messages makes the most sense to me as a sort of ‘must have’ feature… Where each peer has a model of their filesystem as an in-memory object , and every node in the FS tree has a hash. On top of this, the peers have a specific set of packets to ‘share’ their in-memory FS models , as well as efficiently share changes to those models.
I know it would suck to write this in, so I don’t blame anyone for putting it on the back burner, because it’s a big project. And maybe part of it’s already done, or maybe there’s a better way to accomplish the same goals…
But I like the idea of an FS-tree with ubiquitous hashing of all files, folders, and nodes… I like that from the standpoint of ‘future proofing’ … It also seems like doing this would squash a number of bugs and feature requests simultaneously.
I think this sort of technology really opens doors to what we ‘could’ do… For example, it would solve the feature requested in this thread among a few others…Also, I think there is probably a middle ground between what I’ve described in that link vs what is currently implemented.
The ideal way to do this is via incremental change, as that’s much easier to debug than dispensing with the existing infrastructure entirely.
But to summarize, I suppose the most important points would be as follows…
(1) Model each peer’s filesystem in memory as a tree, where each node is a file or folder, and where each node has a hash value dependent upon (A) the state of it’s children and/or (B) it’s data and attributes.
(2) Most important point: Everything gets a hash! Files get hashes which act as their identifiers AND as change-detection methods. Folders get hashes which act as identifiers AND as change-detection methods
(3) If we want to get fancy: Even file-chunks (blocks) each get their own hash, even if each 64k block gets a CRC32 checksum. This lets us exchange only the parts of files that have changed.
(4) Have a method of efficient serialization / de-serialization of FS-tree object (conversion of in-memory FS tree object to a byte-level output) that can be shared among peers and is language independent, byte-efficient, etc . The FS serialization can even involve tree compression so it’s less chatty.
(5) Take these new FS-tree implementations and put them into new explicit protocol-level messages which have the sole purpose of sharing models (or ‘peer-views’) of their filesystems… This way peers can compare and efficiently determine ‘what’s changed’.
(6) By using hash values which automatically ‘propegate’ towards the tree root we can quickly detect if any file or folder changed (and where the change is in the FS tree!) Meaning that any leaf node change implicitly changes the node’s hash, as well as the the value of the hashes for all nodes ‘above’ the leaf node)… This allows efficient detection of changes because a peer may ignore entire subtrees (perhaps containing thousands of files) if the hashes match. This allows efficient traversal of arbitrarily large FS trees.
(7) Clever implementation can limit traffic overhead via exploitation of hashes as both ‘handles’ (for nodes) and ‘versioning’ (for entire subtrees)… ie. We don’t exchange the entire FS tree , except at start of a session – Perhaps we only share subtrees of what’s changed)… This would allow incredible scalability, especially in conjunction with block-level / chunk-level hashes or CRCs.
Interestingly, moving down this path enables the possibility of convergent encryption
Why? Well a file’s hash is it’s symmetric crypto key in convergent encryption, so there is some major overlap here – Alternately, in the more secure version of convergent encryption, we have a global shared secret Global_FS_key, where a file’s symmetric crypto key is HMAC_SHA1(Global_FS_key,file.data)…
This direction also enables the possibility of mounting a ‘streaming’ Syncthing filesystem (if we implement the idea of block-level file checksums which act as chunk ‘handles’ or ‘identifiers’ , coupled with file-hashes and sharing of full FS-tree models among peers).
In the latter case, to stream an avi, I can go to a brand new computer (or smartphone)… Then the software requests Block 0 from Peer 1, Block 1 from Peer 2, Block 3 from Peer 1, and so on… (like RAID striping)…
This allows me to stream the avi without downloading it first, and is made possible by block-level fetching of data from redudant peers… Here I am limited only by (A) the TCP download bandwidth of the destination device, or (B) the combined TCP upload bandwidth of all peers with a copy of the file… Whichever value is lower.
The big ‘what new’ is that since the protocol has shared the structure of the filesystem prior to exchanging data, we have the option of selectively or globally streaming files by only downloading them upon a call to open()… Probably though we’d need FUSE (or kernel drivers) to implement Bitcasa-like streaming / NFS-share functionality, since for this feature we need to intercept the call to open() and initiate a download… We’d also need to intercept calls to fseek to determine which blocks to fetch…
So the idea of a ‘streaming’ file system probably needs the FS model overhaul coupled with either OS-specific kernel-space drivers, or else FUSE drivers. I don’t really see another way to intercept OS-level system calls… But this would be a cool feature because it’d essentially make Syncthing an open-source version of Bitcasa. – in addition to an open-source version of Dropbox, Rsync, and Bit-Sync.
On top of that, the idea that we share the filesystem ‘snapshot’ or structure prior to sharing any data of the files also enables this idea of an NFS-style / bitcasa-style ‘streaming’ torrent filesystem. Anyway, that’s my thoughts at the moment… Subject to change if I’m misunderstanding something important, haha.