Index message without BlockInfo

microwavesafe · November 27, 2019, 2:39pm

I have been playing with a proof of concept client written in TypeScript with node. My rough idea was to create a client that can download single files, rather than sync whole directories.

I have the client to the point where it is receiving Index messages from a syncthing server on port 22000. One thing that jumps out as not ideal is the inclusion of the Blocks (array of BlockInfo messages) field in the FileInfo message in the Index message. This list can be very large and cost a fair amount of bandwidth. It makes perfect sense if all files are always synchronized, which I accept is the use case for this protocol, but my client may not even want to download this file, so downloading the BlockInfo is a waste.

Is there a way we could separate the two messages? Maybe a field in the ClusterInfo I send that informs the server to not include them in the Index message it sends, then add a field in the Request message that asks for the BlockInfo?

This may all be impractical and there’s a fair chance I’m missed an important use case, but I thought it was worth floating the idea.

AudriusButkevicius · November 27, 2019, 2:48pm

Why don’t you just expose your files on the webserver and download them that way? What value does syncthing add in solving this problem?

You need blockinfo to validate the downloads (by checking the hashes and providing them during requests).

microwavesafe · November 27, 2019, 3:36pm

Syncthing provides an encrypted authenticated channel to access the files that I already have running. Another method, requires different servers to maintain and update, as well as different clients to access. It also allows the client to keep a file in sync once downloaded.

Yes you need the BlockInfo to validate the downloads, hence a request for the BlockInfo of a specific file message would be required. You can then get a list of folders and files, request the BlockInfo for a specific file, and use that to download / sync.

This is probably not in keeping with Syncthings goals, but I thought it was worth talking about.

AudriusButkevicius · November 27, 2019, 4:09pm

This would be a breaking protocol change, so I don’t think we have a lot of aptitude for it. I guess it’s possible to bodge it where it’s backwards compatible if you really wanted to.

If you know you will always want one file ever, you could just implement delta indexes that discard data that is not for that file, and then provide offset after connection not to have to redownload the whole index again.

The problem of doing this “in general” is that it adds an extra roundtrip for everything, needlessly.

microwavesafe · November 27, 2019, 4:43pm

It was just an idea, I think caching the block lists locally and using delta indexes is the way to go. It just seemed like a nice way of reducing initial sync data.

calmh · November 27, 2019, 7:36pm

For what it’s worth, files smaller than 32 GiB will always have less than 2000 blocks, which comes in at about 96 KiB. Sure it’s not zero and with many files it will be a few megabytes of index data, but it’s not huge in the grand scheme of things and it’s only sent once.

microwavesafe · November 28, 2019, 12:30pm

These seem like more reasonable numbers. This train of thought started when I was reading the BEP protocol pages (Block Exchange Protocol v1 — Syncthing documentation)

The Syncthing implementation imposes a hard limit of 500,000,000 bytes on all messages. Attempting to send or receive a larger message will result in a connection close. This size was chosen to accommodate Index messages containing a large block list.

Are there instances that really reach this size of index message? I was initially planning to put the block list in RAM and store locally for reading on start up, but keeping 500Mb in RAM is a bit much.

AudriusButkevicius · November 28, 2019, 2:09pm

This rule predates large blocks. Nowadays you’ll have around 2000 blocks per file. You could still have millions of files, which might result in a lot of ram.

calmh · November 28, 2019, 6:54pm

Still, in practice, no because we chop that list into small batch updates. Messages aren’t generally larger than a megabyte or so.

AudriusButkevicius · November 28, 2019, 7:32pm

Sure, I guess I was referring to the index potentially still being big, if that is, the OP is storing the index.

calmh · November 28, 2019, 7:39pm

Yeah I was really replying to the

but didn’t point that out very clearly. Keeping all the index in ram can indeed use a lot of memory. Individual messages should be bite sized, although the spec doesn’t specifically mandate it.

microwavesafe · November 29, 2019, 8:58am

Individual messages should be bite sized

Does this mean you will get multiple Index messages (including headers etc)? Or do you mean you will get multiple packets that include the rest of the LZ4 block?

Multiple Index messages would be good so I can easily buffer the compressed data in RAM, before decompressing and adding to my Index cache. I’m currently thinking I might put the Index cache in an SQLite database. It seems to fit the criteria nicely and is available on all platforms, including mobile.

I have done NativeScript apps before so I’m trying to not include anything that is obviously not available on mobile. Just in case I feel the need to do an app.

imsodin · November 29, 2019, 9:06am

Definitely multiple index messages: Both if there is a huge change, which is split up into multiple messages (if I remember correctly it’s 1000 items per message), and if changes happen with some time in between. The first index may be an actual index message, any following messages will be index updates. I think the example in the BEP illustrates this quite nicely: Block Exchange Protocol v1 — Syncthing documentation

microwavesafe · November 29, 2019, 9:10am

I think the example in the BEP illustrates this quite nicely

I had read that, but to me it suggested the whole index in sent first, then updates. It’s good to have clarification. Now I don’t have to try and deal with massive initial Index messages and can rely on Index update messages to fill in the Index.

imsodin · November 29, 2019, 9:16am

The example is probably from before delta indexes existed. There an initial full index was required (as Jakob wrote below, this is misleading: this initial full index is still split up into multiple, small messages). Now it isn’t required in general, but dependent on the info included in the cluster config message. Explanation is in https://docs.syncthing.net/specs/bep-v1.html#delta-index-exchange

calmh · November 29, 2019, 9:24am

Yeah that part isn’t documented very well. But in no case do we send a very large index message, it’s always split up into a bunch of smaller messages.

microwavesafe · November 29, 2019, 10:36am

I’ve been trying to get my head around the delta indexing. The text says this.

Index ID: Each folder has an Index ID. This is a 64 bit random identifier set at index creation time.

But the message break down puts this field as a device property NOT folder property.

message Device {
    int64           max_sequence               = 6;
    uint64          index_id                   = 8;
}

Looking at the message data ties up with the message break down, not the text.

imsodin · November 29, 2019, 10:42am

That’s admittedly a bit confusing: As part of the folder, we send a list of devices we share that folder with, including ourselves. So if you’re interested in the index id of the device you are connected to, you find it in the respective folder under their device id.
For sending indexes, you also need to know about what the connected device knows about you. For that you check the device with your own folder id, and get the index id and max sequence number from which you need to start sending indexes (for you that’s probably irrelevant, as you send anything (?)).

microwavesafe · November 29, 2019, 10:57am

Thanks for that. It is starting to take shape. At the moment I don’t send any indexes as my code doesn’t go that far.

So I get the recorded index id and max sequence number of myself from the connected device. From this I can check that the index id is consistent and how many updates the connected device is behind myself. I can then send the file / folder updates with sequence numbers between the max sequence sent from the connected device, and my own internal record of the max sequence number.

I then provide an index id and max sequence number in my sent cluster config message to reverse the procedure, so the connected device only sends me the delta index, not the whole lot.

Is this correct?

imsodin · November 29, 2019, 1:02pm

Yep.