Huffman decoder for HPACK 2022-03-27

The road to HTTP/2 (and consequently, HPACK)

Out of the many benefits of HTTP/2 is its ability to not waste bandwidth with headers when compared to HTTP/1.1. This is achieved by doing three things:

  1. Providing small symbols for often-used tokens/header names
  2. Allowing the reuse of headers from previous requests in the same session
  3. Using Huffman encoding to compress unique values, or things that might be reused in (2) in a subsequent request in the same connection.

These three things are the core of what is called HPACK. The first item can be implemented using a simple array lookup, so it's implementation isn't particularly interesting. The second item can be implemented with something like a hash table; this could become interesting, but Lwan already has a reasonably good hash table that I can use. What remains is (3), and that's what we're going to talk about today, and how it's going to be implemented in Lwan.

This is an introductory post about compression, because, well, I'm a newcomer in this space and I'm writing what I know so I can refer to it later. I'm publishing this in the hope that it'll be useful for somebody. If you know how to write compressors and decompressors, merely looking at the table in the HPACK RFC will be enough for you to implement this thing and this blog post will probably be of little value for you.

Huffman... what?

If you're like me, and never studied how compression worked before, you might even have heard of the terms "Huffman Tree" or "Huffman Encoding", but it could very well be called ✨magic✨ – because that's what every technology is, until you understand it. So I spent some time researching about compression, and eliminate the magic.

It turns out data compression is a fascinating field and I was really missing out! There's so much cleverness that went over my head when I was a little fly on the wall watching some of my friends discuss these things over 20 years ago, that I'm just beginning to grasp now. Progress, I think?

Count of how many times each symbol appears in the string we want to compress
LetterCount
H1
e1
l3
o2
,1
<space>1
W1
r1
d1
!1

Like many fundamental things in computing, this idea comes from the 1950s, which is unsurprising, given how efficiently you can compress and decompress something, and how expensive storage used to be back then. It was considered optimal until recently, when things like arithmetic coding became a suitable replacement; about 60 years of being the go-to encoding for pretty much every kind of compression algorithm is, however, very surprising. (HTTP/3 was released 6 years after HTTP/2, in comparison.)

The main idea behind this encoding is to use an encoding that's inversely propostional to their probability of appearing in the input. In order words, short sequences of bits for popular symbols, longer sequences of bits for the rest. This is not unlike, for instance, Morse Code, that I'm sure you have heard of. Both encodings take into consideration the probability, and each code has a very important property: short codes can not be the prefix of the longer codes.

To better illustrate this, let's build a table for something that can compress the string "Hello, World!". The first thing we do, is count how many times each letter appear in our input; this is shown in the table on the right.

From the above table, we can infer that the symbol l will have the shortest code, the symbol o will have the second shortest code, and every other symbol should have codes that are the same size. Samuel Morse, when he designed the per-letter version of Morse Code, did this process manually, probably using brute force over a lot of late nights analyzing texts written in English. It wouldn't be practical to come up with an encoding like that for every thing we want to compress, so it took over 100 years for Huffman to come along and propose an algorithm to do this efficiently.

One way to think about building this code is to hang all symbols on an imaginary clothesline, sorted from the more probable to the less probable of appearing in the input:

loH," "Wrd!e

And then work from right to left, grouping symbols two by two. Each step will look like you're placing a hanger in place of the two symbols you just removed from the clothesline, and hanging both symbols in the same hanger:

," "Wrd!loHe

Repeat with the next two:

," "Wrd!loHe

Now group then together:

," "Wrd!loHe

Keep repeating this until you're left with no more symbols to group together:

Wrd!loe," "H

We can now traverse this tree and build a table that correlates each letter with their respective compressed version. You start at the root, with 0 total bits for the symbol, and each time you go down a level, you either append 0 if you went left, or 1 when you went right. Once you find a leaf node, then you have found the Huffman code for that symbol. Repeat until all of them have been traversed. After you're done with it, your table might look like this:

Huffman codes for each letter in "Hello, World!"
LetterHuffman Code
l0
o10
H11000
e11001
,11010
<space>11011
W11100
r11101
d11110
!11111

From this table, you can confirm some things:

If we use this table to compress the string we set out in the beginning, we could write it as:

The string "Hello, World!", encoded
110001100100101101011011
Hello,<space>

11100101110101111011111
World!

Which has 47 bits in total, not counting the table itself. If you encode the same text with 8 bits per character, you would need 104 bits, which is over twice as much. Not bad! However, you might wonder: how do I decode this, given this variable-length encoding?

Note

If the only thing we needed to compress were the string "Hello, World!", we didn't really need to go through all this trouble. We wouldn't need a single bit of information to store, encode, or decode it, as we would know that the string was exactly that.

The series Compressor Head on YouTube explains this (and many other compression-related things) pretty well.

Decoding

To decode that stream of bits we just used, we can either use the tree we came up with, and follow a similar process we used to build the table: read a bit at a time from the input. Start at the root; if the bit is zero, you go left on the tree; if it's one, you go right; if it's a leaf, you found your symbol. Repeat until the input bit stream is empty. This works, but it's inefficient: there's a lot of pointer chasing, branches, and other no-nos that we don't like to entertain.

We need a way to avoid all that stuff if we want decoding to be fast. We already spent some time producing that table, so why not use it for decoding as well? Can we gain something with it?

Let's write some pseudo-code to explore a bit:

struct huffman_symbol {
     int bit_length; /* Bits from the input stream needed to decode this symbol */
     char symbol;
};

const struct huffman_symbol huffman_decode_table[32] = {
    [0b00000 .. 0b011111] = {1, 'l'},
    [0b10000 .. 0b101111] = {2, 'o'},
    [0b11000]             = {5, 'H'},
    [0b11001]             = {5, 'e'},
    [0b11010]             = {5, ','},
    [0b11011]             = {5, ' '},
    [0b11100]             = {5, 'W'},
    [0b11101]             = {5, 'r'},
    [0b11110]             = {5, 'd'},
    [0b11111]             = {5, '!'},
};

string decode(bit_stream input)
{
     output = "";

     while (true) {
         bits_to_peek = min(bit_length(input), 5);

         /* If there are no more bits to peek, we're done! */
         if (!bits_to_peek)
             break;

         bit_sequence = peek_bits(input, bits_to_peek);

         /* If a bit sequence isn't defined in the table, its bit_length member
          * will be set to 0, which will cause us to throw an error after this
          * condition. */
         if (!huffman_decode_table[bit_sequence].bit_length)
             throw InvalidInput;

         /* If the bit sequence is defined, however, we can go ahead and
          * produce an output symbol.  Since the input bit_sequence could
          * be smaller than 5 bits, we only consume as many bits as that
          * particular symbol required. */
         consume_bits(input, huffman_decode_table[bit_sequence].bit_length);
         output += huffman_decode_table[bit_sequence].symbol;
     }

     /* Do a happy dance! */
     return output;
}

By inverting the table, where the symbol peeked from the input is used as an index, we avoid a lot of the work we would have, were we traversing a traditional tree. (It's a traditional tree, because this is still the binary tree we built above: it's just been flattened in such a way that the race to a leaf node is encoded in the bit pattern that forms each item's index instead.) With a single array dereference, we can get to our symbol and know how many bits we actually need from the input; that's why there are multiple elements in that array for symbols that require less than 5 bits.

That's a clever approach! So clever that variations of this technique have been the bread and butter of many decompressors for decades. This is the one I decided to use in the HPACK Huffman decoder I'm writing for Lwan.

If you want to go deeper, including various ways peek_bits() and consume_bits() could be implemented, the series about compression by Fabian "ryg" Giesen are worth a read.

Note

Variations might include producing a table that contain more symbols in a single entry.

For example: let's say you profile this algorithm and learn that what is dominating the run time is the array indexing to find the bit_length and the output symbol. With this knowledge, you can reduce the amount of indexing by grouping more output symbols in a single line.

In our toy example, where the longest symbol has 5 bits, we could expand the decoding table to contain 1024 (or 210) elements; each element would have at least two of the longest symbols, up to 10 of the shortest symbols. The pseudo-code below gives some idea on how this could be implemented. The actual decoding algorithm wouldn't be that much different than the code already shown: the main difference is that, instead of appending a single character at a time, we can concatenate strings.

struct huffman_symbol_combined {
     int bit_length;
     string symbols;
};

/* Underscores separates input bit sequences that generate each output symbol. */
const struct huffman_symbol huffman_decode_table_combined[1024] = {
  [0b0_0_0_0_0_0_0_0_0_0] = {10, "llllllllll"},
  /* ... */
  [0b0_0_10_10_10_10] = {10, "lloooo"},
  /* ... */
  [0b10_10_10_10_10] = {10, "ooooo"},
  /* ... */
  [0b11000_11011] = {10, "He"},
  /* ... */
  [0b11010_11011] = {10, ", "},
  /* ... */
  [0b11110_11111] = {10, "d!"},
  [0b11111_11111] = {10, "!!"},
};

A similar approach is used by the LiteSpeed Web Server, and their write-up is worth reading.

This approach was not used for the HPACK decoder I plan to use in Lwan.

Alright. We now have a way to encode our data, and a decoder that can chew through it with reasonable efficiency. We can only do that, though, because our decoding tables are hardcoded in our decoders; general-purpose compressors, such as gzip, also need to store that table somewhere. That's one of the reasons that it's usually not worth the trouble to compress smaller files, because storing the compressed stream and the table in the same file might yield a compressed file that's larger than the original.

As far as the table used by HTTP/2, it's specified in the RFC for HPACK. Not only this reduces the size of the encoded data even further by not having to carry the Huffman table with the data, it makes it harder to exploit issues related to encryption of compressed data.

Building Lwan's HPACK Huffman decoder

If you were curious enough to click the link to RFC 7541 above, you might have landed at the nicely formatted Huffman code table. With 257 possible symbols, and bit_length ranging from 5 to 30 bits, we can't simply implement our original approach: we would need at least 230 elements, and that's over a billion elements elements too many. We need a better approach. Let's look at an excerpt of the table we have to work with.

<...snip...>
    ( 30)  |11111111|11111111|11111111|1010         ffffffa  [28]
    ( 31)  |11111111|11111111|11111111|1011         ffffffb  [28]
' ' ( 32)  |010100                                       14  [ 6]
'!' ( 33)  |11111110|00                                 3f8  [10]
'"' ( 34)  |11111110|01                                 3f9  [10]
'#' ( 35)  |11111111|1010                               ffa  [12]
'$' ( 36)  |11111111|11001                             1ff9  [13]
'%' ( 37)  |010101                                       15  [ 6]
<...snip...>

The first two columns denote the output symbol, encoded in ASCII. If it's printable, there's a column with the character between single quotes. The second column has the ASCII code for that symbol, in decimal.

The next column shows, grouped by 8 bits at most, the encoded bit sequence that will produce the corresponding symbol. It's followed by its hexadecimal representation, and the actual bit_length for that symbol. The seemingly redundant information makes it possible to infer the order in which the bits are processed.

I just had to think how I would process this data.

The first thing that came to my mind was to build multiple tables, containing 256 elements each. Decoding a symbol that requires more than 8 bits would mean that we would need to index multiple tables – which would still be reasonable, considering that most symbols, ranging from 5 to 8 bits, would be in the first table, which is more likely to be kept hot in the cache.

In order to know if we need to skip to the next table, all 256 elements are filled out, and we use the impossible bit_length of 0 to denote this. Not all symbols would fill out the entire 8 bits, though, but we can just pad the table with every other possible combination. Our table would, then, look like this:

<...snip...>
' ' ( 32)  |01010000                                     14  [ 6]
' ' ( 32)  |01010001                                     14  [ 6]
' ' ( 32)  |01010010                                     14  [ 6]
' ' ( 32)  |01010011                                     14  [ 6]
'!' ( 33)  |11111110|00000000                           3f8  [10]
'!' ( 33)  |11111110|00000001                           3f8  [10]
<...snip...>
'!' ( 33)  |11111110|00000110                           3f8  [10]
'!' ( 33)  |11111110|00000111                           3f8  [10]
<...snip...>

Of course, writing the whole table like this would take a long time and be very error prone. So, using a bit of Python, we can write some code to read this table, copied directly from the RFC, into a dict that maps each individual symbol to its encoded bit sequence, and then proceed to process that to generate all the tables we're going to need:

symbols = {}
for symbol, line in enumerate(open("huffman-table.txt")):
    _, code = line.strip().split(")  |", 1)
    code, _ = code.split(" ", 1)
    symbols[symbol] = code.split("|")

By drawing the rest of the owl🦉, we generate all tables, and some C code to use them. An excerpt is below.

/* I have not included the whole code, as it is essentially the same
 * structure repeated over and over again.  I'm including this here merely
 * to illustrate one way of chaining tables to decode symbols that might
 * be too long for a single table.  */

struct h2_huffman_code {
  uint8_t symbol;
  int8_t num_bits;
};

/* This table contains all high-probability symbols, with code lengths
 * varying from 5 to 8 bits.  Codes that aren't in the table will be
 * static-initialized, and `num_bits` = 0 will be used to signal that we
 * must use another table.  */
static const struct h2_huffman_code level0[256] = {
    [0 ... 7] = {48, 5},      [8 ... 15] = {49, 5},
    [16 ... 23] = {50, 5},    [24 ... 31] = {97, 5},
    /* ... snip ... */
    [240 ... 241] = {119, 7}, [242 ... 243] = {120, 7},
    [244 ... 245] = {121, 7}, [246 ... 247] = {122, 7},
    [248] = {38, 8},          [249] = {42, 8},
    [250] = {44, 8},          [251] = {59, 8},
    [252] = {88, 8},          [253] = {90, 8},
};

/* This table contains codes that begin with 0b11111111 and are at
 * least 8 bits long.  As with the previous table, unpopulated items
 * signal that the next table has to be used. */
static const struct h2_huffman_code level0_11111111[256] = {
    [0 ... 63] = {63, 2},     [64 ... 95] = {39, 3},
    [96 ... 127] = {43, 3},   [128 ... 159] = {124, 3},
    /* ... snip ... */
    [240 ... 243] = {94, 6},  [244 ... 247] = {125, 6},
    [248 ... 249] = {60, 7},  [250 ... 251] = {96, 7},
    [252 ... 253] = {123, 7},
};

/* ... snip ... */

static inline const struct h2_huffman_code *next_level0(uint8_t peeked_byte) {
  /* The decoder will call this function to determine which table to use
   * next. */
  switch (peeked_byte) {
  case 0b11111111:
    return level0_11111111;
  case 0b11111110:
    return level0_11111110;
  default:
    /* Since level0 has only two unpopulated items that differ by a single
     * bit, this switch/case statement could be written as an if statement
     * that checks if the least significant bit of peeked_byte is set or
     * not.  We can determine if that's possible in Python by using something
     * like:
     *
     *       if len(code) == 2 and (code[0] ^ code[1]).bit_length() == 1:
     *           # Use mask = (code[0] & ~code[1]) to check which bit differs
     *       else:
     *           # Use a switch/case statement.
     *
     * But the code generator has not been taught this optimization yet.
     * In the meantime, inform the compiler that no other input is
     * possible, as it helps it generate slightly better code. */
    __builtin_unreachable();
  }
}

/* ... snip ... */

uint8_t *h2_huffman_decode(const uint8_t *input, size_t input_len) {
  const uint8_t *input_end = input + input_len; /* Get a pointer to the
                                                 * end of the input stream */
  uint8_t *output = malloc(output_size(input_len)); /* Over-allocate
                                                     * memory for the output */
  uint8_t *ret = output;
  struct bit_reader bit_reader = {.bitptr = input};

  while (input < input_end) {
    /* Peak the next 8 bits from the input.  This will not advance the
     * bit_reader.  */
    uint8_t peeked_byte = peek_byte(&bit_reader);
    if (LIKELY(level0[peeked_byte].num_bits)) {
      /* If we got a hit, generate one output byte */
      *output++ = level0[peeked_byte].symbol;
      /* Consume as many bits as we actually need */
      consume(&bit_reader, level0[peeked_byte].num_bits);
      /* And go on to the next symbol */
      continue;
    }

    /* Anything we read from this point will be at least 8 bits long, so
     * we need to discard in the bit_reader the 8 bits we already peeked
     * into peeked_byte.  */
    consume(&bit_reader, 8);

    /* Find the next table given the bits we already peeked */
    const struct h2_huffman_code *level1 = next_level0(peeked_byte);
    /* Peek 8 more bits */
    peeked_byte = peek_byte(&bit_reader);
    if (level1[peeked_byte].num_bits > 0) {
      /* Do the same thing: but notice we use the sign bit instead of just
       * the value here.  The reason is that we also need to check if the
       * bit sequence we read at the beginning of the loop is valid. */
      *output++ = level1[peeked_byte].symbol;
      consume(&bit_reader, level1[peeked_byte].num_bits);
      continue;
    }

    /* ... snip ... */

    /* Go on with the next level here.  It's exactly like the same idea as
     * the previous levels.  (Omitted for brevity.) */
  }

  /* ... (handle remaining bits) snip ...*/

  /* Do a happy dance! */
  return ret;

fail:
  free(ret);
  return NULL;
}

I've been avoiding working on Lwan to conserve my mental bandwidth, so this is the current state of the decoder. It still needs testing, profiling, and some tweaking before I can consider it to be used in a future HPACK implementation for Lwan, but I'm happy I got this far. I didn't know anything about compression before starting this project! Hopefully, if you were in the same boat, this blog post helped you learn something today.

If you liked this post, consider getting me a coffee!