[This is a follow-up to the 13 month old Why we can’t lazy-decode JSON]
I've been hugely distracted this year with commercial and $other work, so for-fun projects have been taking a back seat. Still, that doesn't mean my interests have changed, it's just that my energy is a bit low, and stringing out cohesive blog posts is even harder than usual. Consequently I'm trying to keep this post concise, and somewhat lighter on narrative and heavier on useful information compared to my usual rambling.
It only took 8 months, but finally I've made some fresh commits to Acid, this time progressing its internal Protocol Buffers implementation toward usefulness. Per ticket #41, the selling point of this implementation will be its ability to operate without requiring any copies (at least in the CPython extension implementation), and its ability to avoid decoding values unless explicitly requested by the user.
The ultimate goal is to let Python scan collections at close to the speed of the storage engine (6m+ rows/sec) without having to write tricksy code, import insane 3rd party code (*cough* SQL) or specialized stores (Neo4j?).
I've only prototyped the pure-Python module that will eventually be used on PyPy, trying to get a feel for what the internals of the CPython extension will need to look like, and figuring out the kinds of primitive types (and their corresponding pitfalls) the module might want to provide/avoid.
The design is pretty straightforward, although special care must be paid, e.g. when handling repeating elements, which will be represented by lists, or list-like things that know how to share memory and lazily decode.
The road to a 55x speedup
In the course of the past few days experimentation, quite a fun story has emerged around optimizing bit-twiddling code like this to run on PyPy.
My initial implementation, based on some old code from an abandoned project, was sufficient to implement immediate decoding of the Protocol Buffer into a dict, where the dict-like Struct class would then proxy __getitem__ calls and suchlike directly on to the dict.
The downside, though, per the design requirement, is that in order to implement a row scan where only one or two fields are selected from each row during the scan, a huge penalty is paid in decoding and then discarding every other unused field.
For testing decoding/encoding, I began with a "representative" Struct/StructType containing the fields:
Field #1: varint bool, 1 byte, True
Field #2: varint list, 5x 1 byte elements, [1, 2, 3, 4, 5]
Field #3: inet4, 1x fixed size 32bit element '255.0.255.0'
Field #4: string, 1979 byte /etc/passwd file
Field #5: bool, 1 byte, True
Field #6: string, 12 bytes, my full name
Tests are done using the timeit module to either measure encoding the entire struct, or instantiating the struct from its encoded form, and accessing only field #6. We're only interested in field #6 since in a lazy implementation, it requires the most effort to locate, since the decoder must first skip over all the previous elements.
On PyPy, the initial implementation was sufficient to net around 9.1usec to decode and 5.1usec to encode, corresponding to a throughput of around 100k rows/sec. Not bad for a starting point, but we can definitely do much better than that.
StringIO/BytesIO is slowww
From previous experience I knew the first place to start looking was the use of file-like objects for buffer management. Both on CPython and PyPy, use of StringIO to coordinate read-only buffer access is horrendously slow. I'm not sure I know why exactly, but I do know how to avoid it.
So first up came replace StringIO with direct access. Instead of passing a file-like object between all the parsing functions, simply to track the current read offset, instead we pass (buf, pos) in the parameter list, and all parsing functions return (pos2, value) as their return value. The caller resumes parsing at pos2. For free, we now get IndexError thrown any time a bounds check fails for a single element access, where previously we had to check the length of the string returned by fp.read(1). The fastest code is nonexistent code.
I'm not even going to attempt guessing at why this is so effective, but clearly it is: parsing time on PyPy 2.4.0 amd64 already dropped from 9.1usec to 1.69usec, all for a very simple, systematic modification to each function signature. Now we're up from 100k rows/sec to almost 600k/sec.
Not only that, but now the parser can operate on any sequence-like object that has 1-string elements and supports slicing, including e.g.
mmap.mmap, which you could call the ultimate form of lazy decoding ;)
Lazy decoding take #1
Next up is Implement lazy decoding. This modifies the Struct type to simply stash the encoded buffer passed to it during initialization, and ask StructType during each __getitem__ to find and decode only the single requested element. Once the element is fetched, Struct stores it in its local dict to avoid having to decode it again.
With lazy decoding, work has shifted from heap allocating lists of integers and duplicating large 2kb strings, to simply scanning for field #6, never even having to touch a byte of that 2kb /etc/passwd file embedded in the record. Our parsing time drops from 1.69usec to 0.494usec. Now we're getting warmer -- 2m (struct instantiation + field access)/sec.
inline read_key() call for read_value()
At this point I thought it was already time to break out the micro-optimizations, and so I tried inlining the read_key() function, responsible for splitting the protocol buffer's field tag varint into its component parts, moving its implementation to its only call site.
Not much juice here, but a small win -- 0.44usec.
precalculate varint to avoid some shift/masking
Now we're really in micro-optimizations territory. For a savings of 5nsec, precalculate some trivial math. Barely worth the effort.
specialize encode for PyPy
By now the code is still using StringIO for the encode path, since the convenience is too hard to give up. PyPy provides a magical StringBuilder type, which knows how to incrementally build a string while avoiding (at least) the final copy where it is finalized.
As you can see from the commit message, this switch to more efficient buffering brought encode time on PyPy down by nearly a third.
You'll probably notice by now that the benchmark script used in the commit messages was getting edited as I went along. The numbers in the messages are a fair indication of the level of speedup occurring, but due to horrendous bugs in the initial unrolled write_varint(), I can't easily reproduce the reference runtime from this post using the current version of that script.
Usually loop unrolling is a technique reserved for extremely tricky C code, but that doesn't mean it doesn't have a place in Python land. This commit takes a
while loop that can only execute up to 10 iterations and manually unfolds it, replacing all the state variables with immediate constants and very direct code.
Due to the ubiquitous use of variable-length integers in the protocol buffers scheme, in return we see a huge speed increase: encoding is now 2.5x faster than the clean "Pythonic" implementation. In some ways, this code is vastly easier to follow than the old loop, although I bet if I tried to run flake8 on the resulting file, a black hole would spontaneously form and swallow the entire universe.
use bytearray() instead of StringIO.
While we're targetting PyPy, that's not to say we can't also look after old CPython. Even though a C extension will exist for CPython, having the fallback implementation work well there is also beneficial.
Here we exploit the fact that
bytearray.extend is generally faster on CPython than the equivalent StringIO dance, and so in this commit we bid farewell to our final use of StringIO.
Notice how the use of "
if StringBuilder:" effectively avoids performing runtime checks in the hot path: instead of wiring the test into the
_to_raw() function, we simply substitute the entire function with one written specifically for whatever string builder implementation is available.
For a small amount of effort, encoding on CPython is now almost 25% faster.
partially unroll read_varint()
We're not really interested in encoding -- writes are generally always going to be a slow path in the average web app. We care mainly about decoding, and so back to looking for quick wins in decoder land.
Anyone familiar with what this code does may be noticing some rather extraordinarily obvious bugs in these commits. My only excuse is that this is experimental code, and I've already done a full working day before sitting down to it. ;)
Loop unrolling need not go the full hog. By noticing that most variable-length integers are actualy less than 7 bits long, here we avoid the slow ugly loop by testing explicitly for a 7-bit varint and exiting out quickly in that case.
In return, decoding time on PyPy drops by another 33%.
fully unroll read_varint and fix 64bit bugs.
Aware of all the bit-twiddling terrors of the past few days, tonight I rewrote write_varint/read_varint, to ensure that these functions did what they claim.
In the process, unrolled the read_varint loop. I don't have benchmarks for here, since by this point my benchmarking script would crash due to the aforementioned bugs.
The remainder of the loop unrolling probably isn't so helpful, since most varints are quite small, but at least it is very easy to understand the varint format from reading the code.
only cache mutable values in Struct dict.
We're already down to 0.209usec/field access on PyPy, corresponding to somewhere in the region of 4.9m scanned rows/sec.
At this point I re-read ticket #41, realize introducing the "avoid work" cache of decoded elements was not part of the original design, and probably also wasn't helping performance.
For mutable elements we always track the value returned to the user, since if they modify it, and later re-serialize the Struct, they expect their edits to be reflected. So I removed the cache, preserving it for mutable elements only.
In return, PyPy awards us with a delicious special case, relating to its collection strategies feature. On PyPy, a dict is not really ever a dict. It is one of about 5 different implementations, depending on the contents of the dict.
In the case of a dict that has never been populated with a single element, only 3 machine words are allocated for it (type, strategy, storage), and it's configured to use the initial EmptyDictStrategy.
By removing a performance misfeature, the runtime has a better chance to do its job, and single field decoding becomes 25% faster, yielding our final instantiate + field access time of 0.168usec.
constant time skip function selection [late addition]
This change splits up the _skip() function, responsible for skipping over unknown or unwanted fields into a set of functions stored in a map keyed by wire type.
This has no discernible effect on my PyPy microbenchmark, but it wins nearly 11% on CPython.
remove method call & reorder branch to avoid jump in the usual case [late addition]
Another micro-optimization with no benefit on PyPy. Swaps the order of the branches in an if: statement to avoid a jump on CPython. Additionally replace a method call with a primitive operation (thus avoiding e.g. building the argument tuple).
Yields another 3% on CPython.
use array for sequence decode where possible [late addition]
Both PyPy and CPython have native code implementations (C and RPython) of array.array, which, with a little care, we can reuse to handle decoding sequences of the various integer and floating point types.
This yielded a 7x improvement on PyPy for a sequence of 4000 32-bit integers, I guess because array.array() is internally just doing a string copy.
implement DelimitedSequence [late addition]
Protocol Buffers support several kinds of sequence, but for Acid we're only interested in three:
Packed, variable-length sequences. This gathers all the values for a sequence into a single block of data, then stores the data tagged as a size-delimited field, avoiding the cost of individually tagging each sequence member. Decoding requires bytewise walk over the entire chunk. Encoding requires writing elements to a temporary buffer until the entire chunk size is known, before copying it into the message buffer, since the chunk's length is varint-encoded (and thus we don't know how much space to reserve for it).
Packed, fixed-width sequences. Like the previous type, except the size of the encoded representation of the sequence elements is known, so we can randomly access any sequence element without decoding all its neighbours (since we know its starting offset in the chunk). Encoding requires no temporary buffer, since we know before encoding the elements what size their final representation will be.
Delimited sequences. This is the original Protocol Buffer array format. Each sequence element is individually tagged and stored directly in the message. In Acid, it is used for otherwise quite large items, like arrays of strings, bytes, or embedded structs. Encoding is efficient, however decoding requires scanning every field in the message to ensure all elements have been found. Acid always stores fields in ID order, so we can terminate the search a little early.
On accessing a delimited sequence field, the previous implementation would immediately decode each sequence element. In the case of strings and bytes, that meant immediately doing lots of copies, even if the user was only calling len() on the sequence. For structs, it meant initializing a bunch of Struct instances, which is quite fast, but would involve lots of allocations.
The new implementation instead returns a list-like object that only scans enough of the sequence to discover starting offsets of each element. Then, its __getitem__ uses this array of offsets to lazily decode elements as they are requested, assuming they're ever requested. Since the objects are always likely expensive to construct, for now the decoded representation is also cached.
Per the commit message, for pathological cases, decoding is now around 30 times faster. This one needs revisited later, there are a variety of options for implementing it, and I'm not happy with the fixed cost (3 allocations) even when the number of sequence elements may be tiny.
I notice that 15% of time is spent in Struct.__getitem__, mostly trying to figure out whether the key is a valid field or not, and whether the result value is mutable.
We can get some of this dispatch 'for free' by folding the lookup into the type system by producing proper Struct subclasses for each StructType, and introducing a BoundField type that implements the descriptor protocol.
But that means abandoning the dict interface, and significantly complicating the CPython extension implementation when it comes time to write that, and I'm not sure just how much of the 15% can really be recovered by taking this approach.
So there we have it: in a few hours we've gone from 100k rows/sec to upwards of 6 million/sec, all through just a little mechanical sympathy. Now imagine this code scaled up, on the average software project, and the hardware cost savings involved should the code see any heavy traffic. But of course, there is never any real business benefit to wasting time on optimization like this! And of course, lets not forget how ugly and un-pythonic the resulting mess is.
Ok, I'm all done for now. If you know any more tricks relating to this code, please drop me a line. Email address is available in git logs, or press the Ask Me Anything link to the right of this text. Remember to include an email address!