So finally digging into the code after 3 months, and I must say the previous post was a little optimistic. There is still quite a lot of fundamental drudge-work needing resolved in Acid before I'm going to waste time bikeshedding a new API.
Playing with the code has reminded me why I'm working on this library at all. On a 780MiB LMDB database containing 6.1 million keys, while the currently Python-only BatchIterator yields a paltry 108k records/sec in an index scan, where each index entry results in a random IO to retrieve the primary record, with raw=True (where record decoding is disabled), BasicIterator which is already written in C yields a rather solid 382k/sec. Doing full decoding of the record so that Model instances are produced that are backed by a JSON-decoded dict, this drops the BasicIterator down to 163k/sec and BatchIterator to 42k/sec.
Still I'm not too disheartened by the fully-decoded numbers, since I have yet to begin implementing lazy decoding as described here. For key-only index scans (e.g. as used in joins) and collection scans the numbers look far better: around 2.8 million keys/sec for the index and 1.6m/sec for the collection. For a collection scan and full decoding this drops to around 233k/sec, again before lazy decoding is implemented, which promises performance on a par with no decoding at all.
Back of the envelope math suggests that coupled with a suitable HTTP/network framework, this would allow the quad core i7 these tests were run on to serve somewhere north of 10k HTTP requests/sec if each request were doing a 2x500 index join yielding 100 results.
One of the tickets discovered this morning is the need for an improved batch record format, one that was pretty much already designed before I temporarily downed tools. In Acid, batch records are a way to represent any uninterrupted, ordered range of records as a single compressed record in the storage engine. Since the records are fed to the compressor as a single unit, any data redundancy occurring across record boundaries will be compressed away, much more efficiently than compressing the records individually.
Thanks to the nature of ordered engines, it is possible to locate a batch record for free by performing a lookup on any of its member records' keys. On attempting to access any single, partially or fully overlapping subset, the library transparently decompresses the batch as if it never existed, with the only visible cost mostly being due to decompression time, which behaves very predictably and is easily budgeted for, especially considering the huge performance savings from avoiding the use of an external DBMS server.
Batches are useful in at least two circumstances: when selecting from large ranges of "low information content" read-only data like time series that compress well, and for random lookups on "low request rate" data, such as 10 year old forum threads, that only web crawlers might bother to access periodically. By keeping low traffic or "bulk selected" records compressed, the goal is to keep a much larger subset (say 3x-5x) of an application's data in RAM or other fast storage (e.g. SSD), thus reducing cost, without forcing the end user (i.e. developer) to design specifically for compression.
The old batch format sucks in many ways. Batches are saved with the record's key encoding a list of member record key tuples, from highest to lowest, as a single bytestring key. This ensues a >= search on an ordered engine for any member key will result in the engine's iterator being placed on the batch record.
Within the compressed record value, a variable-length encoded list of accumulative integers describes the starting offsets of each member record's data, before the concatenated member record values themselves.
The first problem with the old format is that since its key must encode all member keys, there is a natural limit on the size of a batch since many engines restrict the maximum key size. For example in LMDB, this is 511 bytes by default.
Secondly, since Acid's key encoding is focused on space efficiency, membership tests essentially require either a linear scan, or fully parsing the complete list of member keys before performing e.g. a bisection search.
Thirdly, when the member key tuple contains some length-varying string, it is possible that a single record with a particularly large string is enough to break a batch into a smaller-than-useful size. Similarly for collections with very long keys, such as URLs, batches would be naturally restricted to perhaps 10-15 records at best due to the many long string keys.
Related to 3, if the key itself is strongly redundant, then an opportunity is lost to compress the key data itself. This again is true in the case of data like an ordered list of web site URLs.
Finally, even if membership tests were cheap, actually locating the start of a member record's data requires either a linear scan or fully decoding the array of record starting offsets. While this isn't so interesting with expensive compressors like zlib, it might become more relevant with a compressor like snappy or lzham.
Ideally we could eliminate all these problems with a little work, and so that's the goal of ticket 30.
The new format will instead:
* Have a physical key that encodes a list of only [highest key, lowest key]. This allows a single batch to contain an arbitrarily large number of members, so long as any key is less than (in the case of LMDB) ~250 bytes.
I was tempted, but am ignoring the possibility for now, to instead have the physical key be [highest key, common prefix length, lowest key suffix] to save even more bytes, since I don't have a good use case.
Have a record value with:
* 16 bit int record count, placing upper limit of 65k members in a single batch. The sentinel 0x00 is reserved to allow the potential future addition of a bloom filter without needing to rewrite existing databases.
* 16 bit int maximum width of any member record's unique suffix. Given a batch containing members [aaab, aaac, aaaca, aaad] whose physical key is [aaad, aaab], the library will detect a "aaa" as the common prefix, and "2" as the maximum length of any unique suffix (because of "aaaca").
The following are concatenated together and compressed:
* Array of member key suffixes, in natural order, right-padded with NUL bytes to the width of the largest suffix. The key decoder will be extended to stop parsing when a tuple element of kind 0x00 is encountered, allowing concatenation of the common prefix and any member suffix+padding, without having to store the suffix length.
* Array of 24 bit int record starting offsets, in the natural key order, with each array element being an absolute byte offset, placing an upper limit of around 16MiB uncompressed data in a single batch, which is really quite huge.
* Concatenation of record data, in the natural key order.
This solves a bunch of problems (and introduces a few of its own!):
By leaving room for later extension with a bloom filter, it might be possible to extend the scheme to be suitable for compressing randomly accessed collections, where the vast majority of lookups produce a negative (i.e. not found) result. As the filter would be stored uncompressed, the expensive decompression step could be avoided in the case that the filter indicates some member key definitely does not exist.
By storing a constant-width array of member suffixes, not only can an individual member key be located using array bisection (logN), but in the case of a single member record with a particularly large suffix causing a large amount of NUL padding on all remaining array elements, the NULs will most likely compress away very efficiently. Any remaining redundant suffix text should also compress well.
Even without the bloom filter, with trickery it would be possible to decompress only the start of the record value, to get at the suffix array and perform membership tests.
Once a record's membership has been confirmed (or it is selected as the start or end bound of some query), its data offset can immediately be found by reading the 24-bit integer (index * 3) bytes into the offset array, again avoiding an expensive list decode/build step.
Well, I haven't tested any of this yet, but the approach seems quite reasonable. Next I'll write some throwaway scripts to model storage overhead of this approach, but I don't expect to be surprised much.
Until next time..
Addendum: turns out the above approach is slightly broken. Extending the key decoder to ignore 0x00 suffixes is only half the fight, since key compares are implemented with
memcmp() - which absolutely does not ignore the 0x00 suffix. So we must either fully decode the key to find its end, and trim the padding, or use another approach. The alternative approach is that the unique member suffix is left-padded with 0x00 during batch record creation, and effectively passed through "s.lstrip('\x00')" during key object construction. This works since no tuple element kind byte can start with 0x00 - so the first valid byte is guaranteed to be non-0x00. Voila, problem solved