After reading that PostgreSQL 9.5 will support BRIN indexes, it took me a couple of re-reads of the attached documentation to understand exactly what this index technique is about. Actually, it's really simple, but for people like me who prefer things to be spelled out, here are some hopefully useful (and at least somewhat accurate) notes.
As a quick recap, table rows in PostgreSQL are stored into an on-disk structure known as the heap. The heap is an array that is logically partitioned into 8kb "pages", with each page containing one or more "tuples" (rows). To ease management, as the heap grows it is additionally split into 1GB-sized files on disk, however the overall structure is still essentially just one big logical array.
When you ask PostgreSQL to insert a row into a table, it uses an auxilliary structure known as the free space map to locate the first available heap page for your relation ("table") that has sufficient space to store the data for your row. If your row is larger than a pre-set limit (2kb), large columns are split out of the row's data and stored in a series of rows in an internal table (the so-called TOAST tables).
The net result is that each data row exists entirely within one page, and that row lives at a particular logical index (the "item ID") within its page. If PostgreSQL must refer to a row, it can uniquely identify it using just its page number, and its index within the page. The combination of this pair of numbers is known as the row's
ctid, or its tuple ID. Tuple IDs can thus be used as a small, efficient, unique locator for every row in a database, and they exist regardless of your schema design.
[Side note: that's not entirely true! If a row has been updated since the database was last
VACUUMed, multiple versions will exist, chained together using some special fields in each version's on-disk data. For simplicity let's just assume only one version exists.]
In the current PostgreSQL implementation, 32 bits are used for the page number, and 16 bits for the item number (placing an absolute upper bound on a single database table to somewhere around 32 PiB), allowing the ctid to fit comfortably in 64 bits.
Using just the name of a relation and a
ctid, PG can first split the page number from the ctid and use that to efficiently locate the physical database file and offset where the page lives:
page_size = 8KiB
pages_per_segment = 1GiB / page_size
segment, index = divmod(page_number, pages_per_segment)
page_offset = page_size * index
Finally to locate the tuple within the page, a small, constant-sized lookup table exists at the start of each page that maps its item IDs to byte offsets within the page:
item_offset = page.lookup_table[item_id]
Without further help, answering a query such as
SELECT * FROM person
WHERE age BETWEEN 18 AND 23 would require PG to visit every page in the heap, decoding each row in turn, and comparing its
age column to the
WHERE predicate. Naturally for larger tables, we prefer to avoid that, and an index is necessary to allow PostgreSQL to avoid scanning the full table.
The most common index type in PG is the btree, which maintains an efficient map from column value to
ctid. Given the imaginary table:
Person table heap layout
|Page Number ||Item ID |
|Name ||Age ||Creation Date |
|1 ||1 ||(1, 1) ||John ||10 ||1998-01 |
|2 ||(1, 2) ||Jack ||99 ||1998-02 |
|3 ||(1, 3) ||Jill ||70 ||1998-03 |
|4 ||(1, 4) ||Jemma ||19 ||1998-04 |
|2 ||1 ||(2, 1) ||George ||60 ||1998-05 |
|2 ||(2, 2) ||James ||44 ||1998-05 |
|3 ||(2, 3) ||Jocelyn ||55 ||1998-06 |
|4 ||(2, 4) ||Jemima ||22 ||1998-07 |
|3 ||1 ||(3, 1) ||Jerry ||60 ||1999-01 |
|2 ||(3, 2) ||Jarvis ||44 ||1999-02 |
|3 ||(3, 3) ||Jasper ||55 ||1999-03 |
|4 ||(3, 4) ||Josh ||24 ||1999-04 |
|4 ||1 ||(4, 1) ||Jacob ||60 ||2000-01 |
|2 ||(4, 2) ||Jesse ||44 ||2000-02 |
|3 ||(4, 3) ||Janet ||55 ||2000-03 |
|4 ||(4, 4) ||Justine ||24 ||2000-04 |
A btree index created using
CREATE INDEX person_age ON person(age) might resemble:
person(age) btree index layout
|10 ||(1, 1) |
|19 ||(1, 4) |
|22 ||(2, 4) |
|24 ||(3, 4) |
|24 ||(4, 4) |
|44 ||(3, 2) |
|44 ||(4, 2) |
|44 ||(2, 2) |
|55 ||(3, 3) |
|55 ||(4, 3) |
|55 ||(2, 3) |
|60 ||(3, 1) |
|60 ||(4, 1) |
|60 ||(2, 1) |
|80 ||(1, 3) |
|99 ||(1, 2) |
This is getting too long already, so skipping to the chase we can see that PG can now efficiently locate an exact row given its associated indexed column value, and that value in turn is stored in a data structure that permits fast lookup.
SELECT query from above, PG can jump to btree key
18 and scan out
ctids until it reaches a key with an entry larger than
23. In the demo table, this means PG must only visit 2 rows from our set of 16, and prior to accessing the row data, it already knows the row definitely matches the predicate.
For some other queries, such as
SELECT COUNT(*) FROM person WHERE age =
22, PG may not even need to visit the row data itself, since it can infer from index entries how many data rows exist. [Another MVCC caveat! This is not entirely true, since index entries may exist pointing to deleted rows, or rows created in later transactions]
The crucial point to note, though, is that one exact index entry is produced for every row, which usually doesn't amount to much, maybe no more than 5-15% overhead relative to the source table, however for a large table, that overhead may be the difference between a dataset that fits in RAM, and one in which common queries end up hitting disk, or IO is doubled due to index access, since the dataset was already vastly larger than available RAM. It's easy to imagine indexes quickly adding up, such that perhaps half of an application's storage is wasted on them.
Finally enough verbiage is spilled so that we can reach the point: BRIN indexes introduce a cool tradeoff where instead of covering individual rows, index entries cover one or more heap pages:
person(age) BRIN index with group size 1
|Page Number ||Has NULL values? ||Lowest Age ||Highest Age |
|1 ||No ||10 ||99 |
|2 ||No ||22 ||60 |
|3 ||No ||24 ||60 |
|4 ||No ||24 ||60 |
The structure is used like so: given a query such as
SELECT * FROM person
WHERE age BETWEEN 10 AND 15, PG will visit every index entry in turn, comparing its minimum/maximum values against the query predicate. If the index entry indicates that a range of pages contains at least one record matching the query, those pages will be scanned for matching rows. For this query, only one page contains rows whose age fields overlap the desired region, and so PG can avoid visiting 75% of the table.
Notice that in order to find just one row, PG must now scan a full page and compare each of its 4 rows against the query predicates. While index size is reduced, query time has increased! There is also little pattern in our
age column: in fact, it is quite lucky that our index described only a single page covering the range
10..15. Had users signed up in a slightly different order, the distribution of ages across physical storage pages may have resulted in PG having to scan many more pages.
[Another side note: unlike our dummy table above, a typical PG heap page may contain over 200 rows, depending on how many columns are present, and how many of those are strings. Our dummy BRIN index above looks as if it contains just as much information as the original btree index, but that's just because my example only has 16 rows instead of 800].
BRIN also permits configuring how many heap pages contribute to an index entry. For example, we can halve the size of our first index while also halving its precision:
person(age) BRIN index with group size 2
|Page Number ||Has NULL values? ||Lowest Age ||Highest Age |
|1-2 ||No ||10 ||99 |
|3-4 ||No ||24 ||60 |
Due to the work-increasing factor, and also since every index entry must be visited (resulting in a potentially high fixed cost for any query), BRIN is probably never useful for "hot" queries against a table, or even much use at all in a typical "hot" database, however for auxilliary queries, such as producing once-per-month reports or bulk queries against archival data, where reduced runtime or IO is desired without the storage costs of an exact index, BRIN may be just the tool.
Finally, notice in the original table how as new records were inserted, their creation date roughly tracked which database page they ended up on. This is quite a natural outcome since as the table grows, newer items will occupy later pages in the array, and so there is quite a reliable correlation between page number and the creation date column value. A BRIN index over this column would work very well.
This was supposed to be a 5 minute "hey, that's cool!" post, but somehow I suck at keeping these things short. I found the source code documentation the best explanation for how this stuff works; the public wiki is pretty vague. If you have corrections, pump the Ask Me Anything link to the right of this page.