Future stability of the protocol

AudriusButkevicius · May 10, 2016, 9:00pm

I think it’s exactly what @calmh was talking about. You don’t need to specify it and it still resolves to something that has all the attributes just at their nil/empty go values.

AudriusButkevicius · May 10, 2016, 9:00pm

Why approximately?

One thing we are still missing is max: 256 etc.

I guess we could work around by limiting message size to 4/8/64 whatever mb, period?

calmh · May 11, 2016, 6:27am

Approximately because I hadn’t taken a close look and was hand waving. As long as it generates slices as I like them, I’m happy. The rest is probably fine. I would like to use proto3 instead of proto2, but I would guess that’s just a tweak. The rest should then be a matter of having a bep.proto in lib/protocol and generating the structs from that, removing the struct definitions we already have, and adding our extra methods as usual. Then changing every MarshalXDR to proto.Marshal etc. I think we should look into this fairly soon after v0.13.

I think we should keep our current framing, i.e. the four byte message header with the message type and compression bits, possibly followed by the compression framing (uncompressed length). We could also define a protobuf top level message type with these bits in it, but we anyway need to frame it with at least a length so I’m not sure it’s worth it? We should research some prior art on this and see what other people do, and if they have good reasoning behind it perhaps.

Yes. I’m still slightly worried about what the protobuf unmarshaller does when it gets a message saying “Here’s a string. It’s 2⁶⁰ bytes large.”. If it actually tries to allocate 2⁶⁰ bytes of memory for that string, that would be a bad thing. I can live with that after authentication - it’s data from a client you’ve authenticated and trust, so if they do bad things you can complain to the owner and remove them from the config. However for the Hello message we need to handle this correctly - by not using protobuf, if there is no way to limit this appropriately.

But maybe it actually only allocates as it reads, in small chunks or something, and then we’re fine by just limiting the total message size. We’ll have to look into the details.

AudriusButkevicius · May 11, 2016, 7:37am

The hello message is already handled, by allocating the buffer or 1024 bytes, and only reading 1024 bytes and then trying to unmarshal.

proto3 is still in beta, and I am not sure what specific features are you after in proto3 that proto2 does not have?

calmh · May 11, 2016, 7:43am

That can still be a problem if the message is malformed and claims the client name string is huge, for example. Depending on what the unmarshaller does.

Hmm, primarily it seemed cleaner (no optional/required stuff, basic fields were nonpointers by default but you already handled that). I think it packed things like []int64 more efficiently, although we don’t have many of those. I don’t know it well enough to know more. It probably doesn’t matter much.

Also, it’s a higher number. Higher numbers are always better.

AudriusButkevicius · May 11, 2016, 7:51am

It doesn’t matter, the buffer it unmarshals from will still never be bigger than 1024, as that’s what we have read off the socket.

I don’t think this is related to the spec, but rather the tool that generates the code, and gogoproto seems to have the options already.

Yeah, but there is no official compiler on most of the distributions, so everyone would have to compile it from scratch, but I can live with that.

calmh · May 11, 2016, 8:01am

Still though, if it sees string, length $huge and does a make([]byte, $huge) that will be a problem regardless of our buffer size. It’s just something we need to check, that’s all.

Re. proto2 / proto3, lets go with whatever you prefer, I guess the most compatible, at the moment.

calmh · May 11, 2016, 10:34am

It doesn’t do that. As it has the buffer already available, it checks that the string length doesn’t exceed the buffer size. Limiting the message size will be sufficient for our needs. Gogofaster looks good.

fingon · May 11, 2016, 7:43pm

One option is also CBOR (which is essentially binary json). You could trivially specify that e.g. syncthing binary format consists of list of (type-tag, content) pairs [ length being implicit in octet string]. CBOR encoding of that would be relatively efficient.

(Not as in ‘optimize to the last bit efficient’, but probably good enough and relatively simple to explain too.)

http://cbor.io

AudriusButkevicius · May 11, 2016, 7:44pm

AudriusButkevicius · May 11, 2016, 8:31pm

But luckily this doesn’t really apply to us, as we don’t expose any of the internal structs.

AudriusButkevicius · May 11, 2016, 10:38pm

So I am close to finishing the schema.

A few issues along the way:

No “noencode” feature
Removing a few dead pieces (Priority)
Tempted to convert current Flags (Introducer et al) to booleans, yet I don’t think these will be packed efficiently. For ClusterConfig maybe it’s not a problem as it’s not sent often, but for a file, changing from a bit to 2 bytes to indicate deleted might be a bit too much?

calmh · May 12, 2016, 5:46am

I think this would encode structs as maps, meaning we repeat our field names everywhere, potentially thousands of times per message. If you’re saying that we could sidestep that by encoding tags by alternating integers and the actual values, that would work but seems contrary to the intention of the format. Someone trying to parse it in Python and expecting “standard” CBOR would wonder what the fuck we’re doing.

I think we could handle this by not having the field in the schema, then embedding that struct in something else that does.

Let’s consider what makes the most sense case by case. Permission bits are permission bits, but other bools should probably be bools I think. They’ll cost two to three bytes (if we run out of low tag numbers), but we’ll also save a few bytes compared to now in a lot of places due to integer encoding. And our indexes are lz4 compressed. So I think going for the cleanest schema is better.

nrm21 · May 15, 2016, 4:59pm

I’m late here and also probably a bit out of my depth but what about a simple TLV protocol? Seems like it meets most of your criteria.

If someone has already sugested then jsut ignore me.

AudriusButkevicius · May 15, 2016, 6:42pm

TLV would be the header, but we need flexibility in the value bits.

calmh · May 15, 2016, 11:41pm

Protobuf is essentially tlv per field, without inventing it ourselves, in a way people are used to.