Currently, sub-messages are always encoded with the length-delimited wire type, which means that in the scenario
message A { B b = 1; // other things } message B { // other things }
we write this as:
b
b
This is effective, but it has the problem that before we can start writing b
, we need to know the length of b
; this has problems; in particular, it requires us to either:
This starts to get particularly gnarly for deep models, where A->B->C->D->E (or whatever, perhaps recursive, perhaps lists) - forcing multiple buffer operations or computations.
† In some API implementations, this may be simple because the lengths are computed while building the data model, but this is not universal between platforms and libraries.
When handling nested deep/recursive models there are some internal optimizations that can be done here so that the length of E isn't computed multiple times, but: fundamentally, this is a design that is not hugely compatible with zero-copy low latency output streaming.
ProposalHistorically, there was another encoding variant available, that was used by the "groups" metaphor in the proto DSL. The "groups" metaphor has been removed from proto3 - which is fine, and I do not propose that we look at that. However, the raw encoding used by groups may still have utility.
Let's consider we had an opt-in mechanism in the .proto schema to enable this mode, which means that the serializer would use the group encoding (but still talking about a message in all ways):
we write this as:
b
Here, the "start group" and "end group" tokens are used as sentinels around the payload. This means we can now write in a purely forwards-way, with zero buffering and zero payload length pre-calculations, and true zero-copy writes in suitable scenarios. In terms of implementation impact, this comes down to:
serializationThe code for all of this should already exist in all implementations, precisely because it existed for groups historically.
It is my view that this should significantly help serialization performance, especially for large/deep models. Even when an overall payload length is required (for example, as an http content-length header), this should still help, by simplifying the work required in that operation, and limiting us to only one event of "write to a buffer, write the buffer length, copy the buffer" or "write to a nul stream and count what we would have written, write the length, write to the real stream" (depending on the implementation). We no longer have the problem of deep/recursive/repeated models forcing multiple length computations.
Emphasis I do not propose to resurrect the groups concept in terms of the object model; purely the encoding. This would also be strictly opt-in, presumably as a custom option or new DSL feature. Note that it is also perfectly possible for a deserializer to permit either encoding, if desired (similar to how a primitive list/array deserializer may allow packed and non-packed).
A minor payload size difference exists in that since the field number now gets written twice, large magnitude field numbers on sub-messages are marginally more expensive. A counterpoint to that is that large payload size sub-messages used to add a few extra bytes for the length prefix, which no longer applies. In the best case (small field number and small payload size) they are identical, each taking 2 bytes cumulatively.
One complication is that it makes the "skip unexpected field" a little more expensive than just jumping a number of bytes (as it requires parsing the sub-headers until the end-group), but not hugely so; and this code should already exist in all implementations anyway, since groups undeniably exist. Perhaps with that in mind, this is effectively "optimize for read" (default) vs "optimize for write" (opt-in).
I welcome thoughts.
mnmr, ltrzesniewski, Drawaes, quickstep24, ennerf and 2 moreQuantumplation
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4