Implementing public shares

alberto101 · October 18, 2014, 9:11am

I’d like to ask if any developers, actual or potential, would be interested in looking at the ways to implement the public shares/folders/collection functionality. Yes, there are some issues to discuss, but there may be no need to attack this thing head-on and do some radical redesign, unless it looks like the end-benefits worth an effort.

Also, it seems to me that the scalability and efficiency may be improved, and I would not be surprised if it could be quite a significant improvement.

From my rough estimate, just by quickly looking around the code, it could be that syncthing/pulse will start having the performance degradation and load issues at about 50 nodes on line, and it might even choke at a 100. Sure, this needs a much more detailed analysis, but still, I doubt it will be able to perform even at medium scale installations.

In particular, keeping the permanent TCP connections seems to be one of the issues that needs further look into it.

The same thing with necessity to download the entire index. What does it buy you and why is it needed?

Thirdly, the issue of a need to know too much of information about the nodes with which you are either in sync or both of you have nothing to contribute to each other for variety of reasons.

That could be a start.

Otherwise, how could you even compare syncthing/pulse to BTSync? For example, one of their reps specifically addressed this issue by saying: well, syncthing isn’t even in the same league, because it can’t do what BTSync can do. Secondly, BTSync, at least in versions before the 1.4.nn disaster update had a much better dynamics reporting GUI-wise. Because they have the Transfer tab which shows you exactly which files are being transferred and to/from where and transfer speed. And that feature I find one of the most desirable. Otherwise, you do not really see the picture of dynamics, especially during active periods where you could be transferring things to/from several nodes simultaneously.

And there are other features and abilities of BTSync that are not currently present or supported by syncthing, and we can discuss the specifics if anyone is interested. I know BTSync quite well from the user standpoint and have been working with it pretty actively for at least half a year and did some analysis of various things, limitations and issues.

So…

Hope it helps.

alberto101 · October 18, 2014, 12:56pm

How to make Syncthing/Pulse BTSync comparable?

Well, what about the following scheme?

There is an additional concept of a share hash. Share hash may be constructed from the share name and represents the combination of share ID and access mode (r/w or r/o).
The share hash is constructed in a way similar to BTSync and contains 1 character for access mode (A - r/w, B - r/o) followed by 32 chars of a share ID hash in base32 encoding, that is 26 alpha chars in capitals plush 6 digits.
When share is created for the first time, in the Add Share/Folder/Collection there is a “Public share” checkbox. If that box is checked, that share becomes public, and, (may be the node that created it automatically gains the status of the Introducer node).
When some other node adds that share hash then, after it discovers at least one other node with that share, it also receives the information that this share is public, which, in terms of config parameters also automatically marks this share as public, sets its mode to r/w and r/o automatically and sets all other config parameters to default values.
The public state of a share is permanent and that share can not be privitized in the future. For that, the new share needs to be created following the usual syncthing procedure.
Any node that attempts to connect or discover to some other node is automatically granted a corresponding access and can start downloading the data without any user interaction.

What kind of problems or issues do you see with this scheme so far?

Alex · October 18, 2014, 12:56pm

this is also something that I hate, from what I know it was the easiest way to compare what is different on the index and I think there was also a plan to improve that in the future.

Maybe save what you already sent on the last connect and only send what was changed since then? But here the problem could be that the other node deleted the index (deleted by the user) and maybe other cases.

A probably easier solution would be that both nodes send a hash of the index of one folder, and only if it’s different they send the whole index of that folder. For me this would be enough since I have some big repos that don’t change very often but have a big index and some small repos that change often but there the index is also small and i don’t care if it’s always sent.

This would be nice to have for syncthing and is maybe not too hard to implement, you could maybe make a issue on github about that

for the other part about performance with many connections I can’t say that much, but from just using syncthing the only problem that I could think about with many connections is the index exchange at the beginning, which should really be improved like I already said before

Alex · October 18, 2014, 1:07pm

At least for me how you described it fits more or less into how syncthing works (and not only something like “i want public shares but i have no idea how this fits into syncthing” - which we already saw in other threads)

This will still need many things that need to be implemented, the first thing that is useful in a separte way would be a readonly flag that you set for other nodes in the config on the repo (Near the “Shared With” setting a additional setting “Do not accept changes from this node”)

bigbear2nd · October 18, 2014, 1:37pm

It is already, at least inside another issue: #292

AudriusButkevicius · October 18, 2014, 2:18pm

It’s no more than a days worth of work if anyone is up for it The more problematic thing is making it fit in with the UI.

alberto101 · October 18, 2014, 3:12pm

Yes. But the benefits?

And the benefits are that you would get LOTS of users abandoning the BTSync for various reasons, which quite a few already have expressed the desire to do, regardless, right this very moment and quite a few of them are going back to the 1.3.nn version because of this new web browser based UI, which is simply horrible to me.

For example, security and encryption arguments by BTSync are nothing more than words. Because it can not be verified and what kinds of backdoors they have in their code only God knows, well, and the NSA and other agencies, of course.

Secondly, they are closed source and you are an open source. Nobody can fix tons of their bugs, even if they could. And everybody can fix your bugs if they can.

Open source accumulates knowledge and creativity, and the closed source self-destructs eventually, because in the modern world it is not that easy to keep up with “the bleeding edge”. Simple as that.

The problem with the browser based UI is that browsers are not designed for it, and HTML and CSS are not dynamic in nature. So, you need either Javascript or PHP, and that is where the problems start, but not END, by ANY means.

My estimate is that it should take a couple of days to get the first version rolling to begin testing the major aspects of it.

The main issue I see is node discovery based on share ID hash, even though the present discovery mechanism should work, except you need a translation from the share ID to node ID, which might be generated by another or existing hashing/node generation algorithm.

Yes. That is what I was concerned with mostly. But it looks like it should not represent much of a problem.

Scratch that share/folder name hashing to generate the hash ID/access. That is not such a great idea because you might easily get the duplicate share names, and that would generate the duplicate share IDs, which MUST be unique.
Instead, do the same thing as BTSync, and that is, you MAY enter the share name, but it is nothing more than a label, something that might make sense to the users. But the MAIN ID should be generated by simply pushing the Generate button next to the share ID edit field. By pushing this button, you regenerate a new and unique share ID/access key,. That will guarantee that share/folder IDs will be globally unique.

Yes, that is still, at least theoretically, is not guaranteed to be globally unique. But if you detect that you are a duplicate, you simply use an additional signature, which is a plain ordinary GUID. And if THAT combination is not unique enough, then you have another option: jump from the Golden Gate bridge in San Francisco.

Actually, there are two buttons, one - for the r/w access and the other one for the r/o access. Both have the corresponding edit box next to them where the results of generated key are displayed.

Add the ability for node discovery via “Predefined Host”. Predefined host is any host name/IP:port that knows about this share. It is not even necessary for it to have the physical data for the share in case you want to use that fixed IP:port host as some “global tracker”, be it on a private network, even though it might, if so desired

a) But the ability to behave purely as a tracker is highly desirable. Because you can use it for MANY shares and all sorts of nodes. It becomes your own universal tracker. Once some share info is added to it, from then on, ANY node globally may use it just like a plain ordinary torrent tracker.

b) The “Predefined hosts” is actually a list, not just one particular host. You can add as many of those hosts as you wish. The more, the better and the higher chance of node discovery. And the beauty of it is that any host with a fixed host name or IP and a well known port will behave like a predefined host and a tracker at the same time.

And it can be changed, edited, removed or you name it any time you want. As long as all other actual or potential participants know it.

If you add a checkbox in share/folder edit dialog that says: “Use as a tracker only” when you create that share, that means that this node does not actually contain the data for the share. It is merely a tracker that knows about ALL the nodes on this share.

a) But the tracker functionality is on per share/folder basis and not per node. There is no such a notion as global per node tracker. It simply does not make sense logically, even though it is possible to use some nodes as global suppliers of any public information they know about. But this is the “next step”, if it makes sense at all.

Furthermore, ANY node should be able to behave as a general purpose tracker in respect to the shares it knows about. As soon as the 1st node that actually has at least some data for this particular share is on line, any other node contacting it ever since, regardless of how it learned about it, will get the list of ALL the known nodes for this share, including the contacting nodes themselves. They simply become the members of the pool, regardless of whether they do have any data, or are empty at the moment. But there are some details to consider here.

a) Any node can behave like a tracker for as long as you discover it by any means available, such as DHT, which is a pretty good and efficient general purpose mechanism of node discovery and mapping the share ID hash to the list of nodes that have that share.

The DHT code is in public domain and there is even the github repository for it if I recall. There are versions at least in C++ and Java, and that is for sure.

DHT is a “must have” mechanism as far as I can see. There is no better and more universal and more efficient global key discovery mechanism other than DHT, unless you have a reliable and guaranteed “Predefined host” known to all and is always on line.

Alternative solution for GUI design

There is an alternative solution to the GUI design issue, which will reduce the amount of extra checkboxes, edit fields and lists in the main share/folder edit dialog.

You simply provide the “Public Share” button. When that button is clicked, it opens up a dialog that is specifically designed to deal with all the necessary information for the public shares. That way, you do not overload the perception of plain ordinary users, who are interested exclusively in private mode sharing, with unrelated information. So, there is nothing much to change even from the GUI aspect of it.

jpjp · October 18, 2014, 5:40pm

Any chance you could write a couple of use cases?

alberto101 · October 18, 2014, 6:00pm

In case you are talking to me: Use cases of what? Of moving perfectly “on topic posts” to some hole?

Or “proving” the need for a concept of a public share?

I am sorry, but I do not understand the very concept of a “use case”, at least if not defined used by whom and for what purpose.

In case you are talking about public shares, the “use case” would be the distribution of any kind of information globally to any users who are interested in it, just like any torrent.

So, the question would reduce:

What is a “use case” for torrents?

The only difference is that torrents are static and sync is dynamic, meaning that if you are a supplier of information and that information is changed, updated or extended, the sync approach would guarantee the delivery of the “latest and greatest” version, while torrents could not do that. Because torrents, once created, can not be modified, updated and extended.

alberto101 · October 19, 2014, 12:39am

OK, we have done our part in what we consider to be one of the most significant issues related to public shares.

If there are any issues or problems you find in what has been proposed, we would certainly allocate some time for it. But, since we do not do coding nowadays, it has to be implemented and tested by other developers.

Yes, if and when the code will be well documented, who knows what might happen.

One of the most critical issues of concern at this junction is the issue of “control”. In open source approach any notion of “control” becomes problematic as the process is inherently undemocratic in its very nature. Yes, there is a need to come to some solution that is beneficial to the project and does not cause the instability or impact the consistency of data across the shares, or performance and so on.

But since programmers are not exactly some “street crowd” and are creative and are logicians, then exerting “control” over them, like with some “crowd”, could impact the natural growth of the project.

It is our opinion that the need of functionality of the public shares is an issue of a critical grade. So, considering that the set of issues do not appear to represent some overwhelming technical challenge and could be implemented in a few days, at least in the preliminary version ready to be tested, we hope that this issue will be considered with more attention than it has been afforded to this day.

So, we are done with this thread, unless some specific technical issues will come up.

P.S. What does it mean “Dev” in a black box at the top of the page? This thread does not show up in the Support category and so it is not clear how can it be visible to the general public visiting these forums? If this happens to be some “private” place for the devs, this would be highly undesirable in our opinion, at least if there is no access to it from the top level hierarchy of the forums.

Otherwise, it may become an exercise in futility, or some “private club” for the “elite”, which is a bad idea in this case. All the messages posted by this author were written for general public and not for some “elite” or “undesirables”.

Good luck now.

Cyphase · October 19, 2014, 3:08am

If you hover over the Dev tag, you will see:

This category is for topics related to hacking on syncthing: submitting pull requests, configuring development environments, coding conventions, and so forth.

Topics in Dev aren’t treated any differently AFAIK. They appear on the front page; I just came here from the front page, where 3 of the top 5 non-pinned posts are tagged Dev.

alberto101 · October 19, 2014, 3:11am

Hey, thanx,

I just came from there. I hope Java port will be completed at some point, then we can look at doing some coding. Well, may be.

hobarrera · October 20, 2014, 5:05am

Regarding the scalability issue mentioned in the first post (refering to Pulse having issues with ~50 nodes). This can sort-of-be avoided by having them work as introducer nodes:

Let’s assume we have an already connected and synced client called Bob:

Cap clients we sync to for a single repo at 50.
When a new client (eg: Alice) tries to connect, introduce her to 1 (or more) of the previous clients. Let’s assume we introduce her to Bob.
Alice will sync to Bob. Bob is sync to our server, so they have the same data (syncronization is a transitive property).
Changes are pushed to client Bob and he, in turn, will push to Alice.

I don’t think that this dramatically alters the design of syncthing anyway, we’re just capping a few things. This would help, even for non-public nodes.