Evaluating Idea: Syncthing for Serverless

You can disable the web UI?

Running Syncthing on a server (like Ubuntu Server) does not need and can not have a GUI - of-course it is possible to connect to it if the port is exposed to the outside.

There are documentation on how to use the REST API: https://docs.syncthing.net/dev/rest.html.

Yes, it is possible to so using the configuration file.

How are you going to manage it in that case?

On each node a program is running which is written in Go that simply configs and runs Syncthing as an external process. And then communicates with it via APIs, or simply works with files inside the shared directory.

Currently I am studying ipfs which seems to be more fitting.

But I like Syncthing much better!

Take syncthing, I wanna know about a 40k cluster :slight_smile:

3 Likes

Right, I am still not sure why rsync on a 2hr cron doesn’t cut it?

It is two minutes not two hours and I am investigating ipfs currently.

Anyway;

Can rsync help with:

  1. Syncing large files too (10 - some hundred MBs)?
  2. Syncing from 40,000 clients to three or more supervisor servers (without manual load balancing) and keep those supervisor servers in sync too?
  3. Upgrading clients themselves by sending the new version to clients (and when they get the new version, they just restart themselves with the new version)? Or other communication from supervisor nodes to normal nodes?
  4. Doing all that on crappy internet connections? With too many disconnections? And sometimes very low bandwidth? With many nodes on the move, connecting via GPRS Modems (or G3)?
  5. Doing all these without “creative in-house” re-inventing-the-wheel code (which systems like Syncthing provide built-in)?

If the answer is yes then I have to relearn rsync more deeply.

I don’t think IPFS is really meant as a file sync solution. From what I know, it is missing a lot of features, like encryption, adding files automatically, etc. So you would have to code all that.

I think Syncthing would be a good solution, if you can figure out a good network topology that limits the number of connections per device. But the problem of file sync itself is already solved for you.

1 Like
  1. Yes
  2. Yes
  3. Rsync has been stable for like the last 10 years, there is barely any new versions ever released.
  4. Rsync uses tcp, same like syncthing, so not sure what the advantage syncthing would have here.
  5. How is setting up cron creative and reinventing the wheel?

Meaning of number 3: Two way transmission of data from regular nodes to supervisor nodes and vice versa. I intend to put the upgrades for the nodes themselves inside a shared directory - I am talking about the program that I wrote in Go that runs on those nodes; not about rsync ot Syncthing.

About number 4: Syncthing syncs data in chunks and keeps track of which chunks are already synced. So if a 10 MB data is synced up to its 9th MB and then internet connection drops, Syncthing will not start from zero.

About number 5: Keeping track of which files are synced and which ones are not is one of the things that Syncthing does for you for free. Also rsync needs an explicit IP address. In case of Syncthing the supervisor nodes do not even have to have public valid IPs. And new Supervisor nodes can be added at any time if needed and clients will know about them using the introducer feature, set on supervisor nodes. And some of the nodes are not Linux based but are Windows nodes.

rsync does not provide any of it and needs an explicit valid IP address or URL and I do not know about Windows support. All in all to me, rsync is a fancy cp and if that was not true, then why Syncthing came into existence? And why Syncthing is so popular?

Maybe you should introduce rsync more extensively to the community here and to developers like me that do not know rsync good enough (some blog post or something; which would be awesome!).

Good point!

The problem is not with the devices. Since they are meant to connect to only one supervisor node.

The problem is with supervisor nodes. As far as I understand, that high number of clients would consume a huge amount of resource, on sepervisor nodes/servers.

you can work around that by organizing the devices into different tiers. So the supervisors are tier 1, and connected to one or more tier 2 servers each. The tier 2 servers are connected to the actual devices. There are a bunch of topics about this on the forum:

https://forum.syncthing.net/search?q=topology

1 Like

Late to the party but I will throw my two cents.

If I understand correctly, you want to keep files in sync between ~40K computers.

Having every computer connected to each other may probably be too much overhead. Having a handful of “supervisors” would be almost as bad.

The first solution that came to my mind is to use tiers (as Felix suggested above). However, that would make a subset of computers critical for your file replication setup (e.g. if you lose A, B and C then a whole segment of your computers becomes cut off and out of sync…).

Option B: what about having each computer connect to 100 or so of the others? You may have something simple such as each node connecting to its neighbours (e.g. n100 is in sync with n050, n51, … n150) or design a fancier topology in order to speed up file sync propagation.

Of course, using Syncthing assumes that you are ok with non-real time file replication/availability, else you need a shared fs, as mentioned by Audrius.

I have already been convince that using Syncthing is not a solution - most probably.

Those 10,000 to 40,000 client have not to be connected to each one. They are completely separated both in terms of logic and data. They are some sore of data gathering devices with low resources. So they can not act as sub-graph nodes.

The problem is the supervisor nodes and apparently having 40,000 devices connected to one main Syncthing node, is not a good idea.

Rsync is a one way sync… When installed on both server and client it is the equivalent of Syncthing doing a scan and updating the changes as it finds them. You can use the partial flag to keep partially completed transfers (continue from the 9th MB) It uses a variable block size so in some cases is more efficient to transfer files that have had data inserted.

The Man Page is amazing: https://linux.die.net/man/1/rsync

For the upgrades of the nodes… Http hosting would be a better option. ST would be adding a lot of overhead for no gain.

1 Like

One of the main reasons that Syncthing came to my mind (though as described in a previous comment of mine, it might not solve this problem) is not having a host (TCP, HTTP, IP, etc host).

If there are no hosts, there are no places to get attacked.

Of-course the data is used in another website (for providing analytical reports and notifications and the like).

But this way (assuming it is possible) there will be no host to be attacked on data gathering side.

Currently I am studying ipfs which might help with this problem. Although this does not mean Syncthing has any shortcomings - it is a perfect, working and hassle free solution for keeping directories in sync on different machines.

I am not sure if there are other solutions to be studied other than ipfs (torrent? although torrent has some difficulties in my country).

This is genuinely false, unless you read mermaid magazines where network communication is over pixie dust.

I think you should understand what you are dealing with before making decisions and religious sacrifices.

In my case it is true (sorry that I have not provided more information).

First most of the clients are connecting via GPRS to the servers on a non-public network. So (at least theorically) all client nodes are detached from the rest of the world.

It is only servers that should be accessible to this network and to the internet (and our other networks).

IMO with this architecture I am hiding server nodes even from other servers and services of our own. Because they too have to get the data via Syncthing.

Again. I am not going to use Syncthing. Currently those clients are sending messages to some data-gathering servers via TCP and it is most likely not going to change - the data-gathering servers and app servers are placed on two different networks (details are done by DevOps guys) and there are physical firewalls and things.

I confess I could use a more meaningful comment but anyway, the title of this thread is “Evaluating” an idea.