RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/protocolbuffers/protobuf/issues/9134 below:

A plea in favor of resurrecting "group" encoding, for sub-messages, for serializer performance · Issue #9134 · protocolbuffers/protobuf · GitHub

Problem statement

Currently, sub-messages are always encoded with the length-delimited wire type, which means that in the scenario

message A {
    B b = 1;
    // other things
}
message B {
    // other things
}

we write this as:

field header: field 1, wire-type 2 (2=length prefixed)
varint of the serialized byte-length of b
that same number of raw bytes representing the payload of b

This is effective, but it has the problem that before we can start writing b, we need to know the length of b; this has problems; in particular, it requires us to either:

buffer the entire payload in-memory; write the buffer length as varint; copy the buffer out
serialize twice, once to a nul output
already know the length (see †)

This starts to get particularly gnarly for deep models, where A->B->C->D->E (or whatever, perhaps recursive, perhaps lists) - forcing multiple buffer operations or computations.

† In some API implementations, this may be simple because the lengths are computed while building the data model, but this is not universal between platforms and libraries.

When handling nested deep/recursive models there are some internal optimizations that can be done here so that the length of E isn't computed multiple times, but: fundamentally, this is a design that is not hugely compatible with zero-copy low latency output streaming.

Proposal

Historically, there was another encoding variant available, that was used by the "groups" metaphor in the proto DSL. The "groups" metaphor has been removed from proto3 - which is fine, and I do not propose that we look at that. However, the raw encoding used by groups may still have utility.

Let's consider we had an opt-in mechanism in the .proto schema to enable this mode, which means that the serializer would use the group encoding (but still talking about a message in all ways):

we write this as:

field header: field 1, wire-type 3 (3=start group, already exists, disavowed)
payload of b
field header: field 1, wire-type 4 (4=end group, already exists, disavowed)

Here, the "start group" and "end group" tokens are used as sentinels around the payload. This means we can now write in a purely forwards-way, with zero buffering and zero payload length pre-calculations, and true zero-copy writes in suitable scenarios. In terms of implementation impact, this comes down to:

serialization

a different wire-type gets gets written in the field header
we do not compute or write the length before the message
we stomp the end-group sentinel after the message

deserialization

when deserializing a group-encoded sub-message, we store or pass-onwards the expected "end group" field-header/field-number
after reading a field-header, in the "unexpected field" branch, we check whether it is an end-group that matches the number we stored in 1. immediately above; if so, we exit out as though we had reached the end of the length
assertions that we have correctly exited out from the group, i.e. we didn't EOF mid-group

The code for all of this should already exist in all implementations, precisely because it existed for groups historically.

It is my view that this should significantly help serialization performance, especially for large/deep models. Even when an overall payload length is required (for example, as an http content-length header), this should still help, by simplifying the work required in that operation, and limiting us to only one event of "write to a buffer, write the buffer length, copy the buffer" or "write to a nul stream and count what we would have written, write the length, write to the real stream" (depending on the implementation). We no longer have the problem of deep/recursive/repeated models forcing multiple length computations.

Emphasis I do not propose to resurrect the groups concept in terms of the object model; purely the encoding. This would also be strictly opt-in, presumably as a custom option or new DSL feature. Note that it is also perfectly possible for a deserializer to permit either encoding, if desired (similar to how a primitive list/array deserializer may allow packed and non-packed).

A minor payload size difference exists in that since the field number now gets written twice, large magnitude field numbers on sub-messages are marginally more expensive. A counterpoint to that is that large payload size sub-messages used to add a few extra bytes for the length prefix, which no longer applies. In the best case (small field number and small payload size) they are identical, each taking 2 bytes cumulatively.

One complication is that it makes the "skip unexpected field" a little more expensive than just jumping a number of bytes (as it requires parsing the sub-headers until the end-group), but not hugely so; and this code should already exist in all implementations anyway, since groups undeniably exist. Perhaps with that in mind, this is effectively "optimize for read" (default) vs "optimize for write" (opt-in).

I welcome thoughts.

mnmr, ltrzesniewski, Drawaes, quickstep24, ennerf and 2 moreQuantumplation

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4