cdb: Format

The text below is copied from the original 19960914 document "A structure for constant databases" by D. J. Bernstein. Other explanations of the format are online from Yusuke Shinyama, Richard James Howe, and Wikipedia. The cdb64 format uses 64-bit quantities stored in 8 bytes; conceptually this can handle 16 exabytes, but the current code limits it to 1 exabyte.


A cdb is an associative array: it maps strings (``keys'') to strings
(``data'').

A cdb contains 256 pointers to linearly probed open hash tables. The
hash tables contain pointers to (key,data) pairs. A cdb is stored in
a single file on disk:

    +----------------+---------+-------+-------+-----+---------+
    | p0 p1 ... p255 | records | hash0 | hash1 | ... | hash255 |
    +----------------+---------+-------+-------+-----+---------+

Each of the 256 initial pointers states a position and a length. The
position is the starting byte position of the hash table. The length
is the number of slots in the hash table.

Records are stored sequentially, without special alignment. A record
states a key length, a data length, the key, and the data.

Each hash table slot states a hash value and a byte position. If the
byte position is 0, the slot is empty. Otherwise, the slot points to
a record whose key has that hash value.

Positions, lengths, and hash values are 32-bit quantities, stored in
little-endian form in 4 bytes. Thus a cdb must fit into 4 gigabytes.

A record is located as follows. Compute the hash value of the key in
the record. The hash value modulo 256 is the number of a hash table.
The hash value divided by 256, modulo the length of that table, is a
slot number. Probe that slot, the next higher slot, and so on, until
you find the record or run into an empty slot.

The cdb hash function is ``h = ((h << 5) + h) ^ c'', with a starting
hash of 5381.

Version: This is version 2025.10.21 of the "Format" web page.