Integer to string conversion

There are various ways to convert integers to their string representation. These conversions are rarely a bottleneck, but they often show up while profiling certain applications. For instance, they’re very common in Lwan while building the response headers.

To use Lwan as an example: initially, snprintf() was used to convert numbers. Although this works, it is quite boring, performance-wise.

The second approach was using the naïve algorithm, which basically divides the number by 10 in succession, writing backwards the result of modulus by 10 to a string, and then reversing the string when the number reaches 0.

// Code based on https://code.google.com/p/stringencoders/
size_t naive_uint32_to_str(uint32_t value, char *str) {
    char *wstr = str;
    // Conversion. Number is reversed.
    do
       *wstr++ = (char) decimal_digits[uvalue % 10];
    while (uvalue /= 10);
    *wstr = '\0';
    // Reverse string
    strreverse(str, wstr - 1);
    return wstr - str;
}

This was fine for a while, but that string reversion step always bothered me. Why not just write the string backwards already?

I’ve then changed the code in Lwan to the following snippet. Note the nice trick of multiplying the size of an integer in bytes by 3 to obtain an approximation of the number of digits for MAX_INT, including the zero terminator, regardless of what sizeof(int) is.

#define INT_TO_STR_BUFFER_SIZE (3 * sizeof(int))

char *lwan_uint32_to_str(uint32_t value,
            char buffer[static INT_TO_STR_BUFFER_SIZE],
            size_t *len) {
    char *p = buffer + INT_TO_STR_BUFFER_SIZE - 1;

    *p = '\0';
    do {
        *--p = "0123456789"[value % 10];
    } while (value /= 10);

    size_t difference = (size_t)(p - buffer);
    *len = (size_t)(INT_TO_STR_BUFFER_SIZE - difference - 1;

    return p;
}

Reducing writes to the array made this algorithm significantly faster. However, I eventually did what one should always avoid when tinkering with this kind of thing: I’ve changed the array lookup to an addition, without measuring if it would perform better, and committed the code anyway. The lookup table is ~9% faster. Ouch!

Last year, the Facebook Engineering team posted a function to convert integers to strings that manages to be even faster. They do use the same idea of avoiding having to reverse the string after they’re done converting each digit, and they use a lookup table as well.

But the nice trick is that, instead of having a lookup table for 10 digits, there’s a table for all pair of digits, from 00 to 99. This cuts the amount of divisions by half, yielding a significantly faster algorithm: around 31% faster than the above snippet:

size_t facebook_uint32_to_str(uint32_t value, char *dst)
{
    static const char digits[201] =
        "0001020304050607080910111213141516171819"
        "2021222324252627282930313233343536373839"
        "4041424344454647484950515253545556575859"
        "6061626364656667686970717273747576777879"
        "8081828384858687888990919293949596979899";
    size_t const length = digits10(value);
    size_t next = length - 1;
    while (value >= 100) {
        auto const i = (value % 100) * 2;
        value /= 100;
        dst[next] = digits[i + 1];
        dst[next - 1] = digits[i];
        next -= 2;
    }
    // Handle last 1-2 digits
    if (value < 10) {
        dst[next] = '0' + uint32_t(value);
    } else {
        auto i = uint32_t(value) * 2;
        dst[next] = digits[i + 1];
        dst[next - 1] = digits[i];
    }
    return length;
}

The digits10() function is also another function that calculates the number of digits of a number in a very efficient manner. Even being performant, though, one can get rid of the call altogether: using a constant like numeric_limits<uint32_t>::digits10 will keep the same interface. This is possible because the dst buffer should be large enough to hold all the digits of the largest 32-bit unsigned integer anyway.

Because of implementation details – the function basically compares numbers to powers of 10 and recurses when the number of digits surpasses the maximum power that they’re comparing to – the speedup of using a constant length won’t be significant for small numbers (one and two digits); but if you’re optimizing to this level, using a constant won’t hurt. So much so, that it is consistently faster on my machine (a Core i7 2640M laptop, with an up-to-date 64-bit Arch Linux):

relativespeedup

Relative speedup of facebook_uint32_to_str() using digits10() and a constant value

That chart was obtained by using a benchmark program I wrote that will test all these ways of converting an integer to their string representation. To compare with other methods, here’s the full chart:

benchmark

Results for snprintf() omitted to not skew results. Spoiler: it’s slow.

Unfortunately, there’s a licencing issue that won’t let me use this code in Lwan. The blog post doesn’t mention the license. I’ve found this two-digit lookup table in places unrelated to Facebook as well, so I’m not sure who had this idea first. My go-to source of this kind of thing is usually Hacker’s Delight, but even then it’s not there.

Reducing Lwan memory usage by 94%

One of the things that bothers me when I’m writing software is that I never get things right the first time. It takes me quite a few iterations to achieve a good result – be it performance, memory usage, or a good architecture. Getting things to a “good enough” state is also very frequent as projects need to move forward; however, written code often ends up in sort of a low priority refactoring thread inside my head. Sometimes this thread is able to produce a thing or two, and I’m able to revisit these things.

projectmovingforward

Project moving forward picture by Todd Smith. Sometimes you’re so focused on the goal that you end up not appreciating the journey.

Background toys

One of the things that were in that refactoring thread was my toy web server‘s memory usage. It would consume a whopping 855MB of memory while idling; recent commits dropped this amount to a mere 32MB (with maybe some more room to spare). It used to use 2670% more memory.

This was only possible because I know the code inside out and was able to refactor the code a few times.

massifscreenshot0

Massif-visualizer windows shown at different scales.

Structure diet

Lwan allocates almost all memory it is going to need even before creating the main socket. This means it has to keep around some structures with information about connections, requests, and their responses.

The first drop in memory usage was the highest one. It was possible because the structure that keep state for these things also kept state that was only useful during the request parsing stage. By segregating this temporary state to another structure, which is allocated in the request parsing routine stack, memory usage fell dramatically.

Lots of flags were saved using bitfields in different substructures. Most of these were booleans, and having less than 32 of them meant I could coalesce all of them in a single unsigned integer. Memory usage dropped again.

Architecture smell

Then a few months passed, and out of the blue I realized that there was something wrong in the architecture: the same structure I was using to track request state, I was also using to track connection state.

So I moved all things that only matters to a connection to a structure – which is the structure that’s preallocated on startup – and made the request structure be allocated in the request processor routine’s stack. This stack lives in a coroutine – which won’t use more memory than it was already allocated for the coroutine stack. Another worthy reduction of memory usage.

This also made keep-alive connections a tiny bit faster, as there’s no need to memset() the request structure to clean state for the next request anymore.

massifscreenshot

Same scale this time. That drop.

Reducing it further

There’s another possibility for memory reduction, but I’m not sure if it is worthy implementing.

Lwan uses epoll() – and when a file descriptor is added to a poller, one can pass arbitrary data inside epoll_data_t, up to 64-bit in size. Both the file descriptor and the remote IP address could then be passed as this data, removing both fields from the connection structure.

This is possible because these are constant values while the connection is active; everything else is either useless to identify the connection (the file descriptor is used as an index in an array of connections) or changes all the time, such as the flags (which would incur the penalty of calling epoll_ctl() every time they change).

This would reduce structures by a few megabytes, which isn’t really worth the effort considering IPv6 support would need to be implemented someday and this trick would be then rendered useless. Maybe my refactoring thread will be able to answer that in a few months.

I’m still considering if it is worthy the trouble of leaking the request/connection abstraction and removing an integer from the request structure so all request-related flags would be set in the connection structure.

Update (11 Dec): I’ve found another way to remove these two structure members; I’ve committed this code on a separate branch as further tests must be performed. In the same circumstances as the other tests, the server is now using 2MiB less memory. Basically:

  1. The remote IP address can be obtained through the getpeername() function; since it’s not usually required, the need to keep this information around is reduced.
  2. The socket file descriptor can be calculated by pointer arithmetic. Each connection has a reference to the huge connection array that it is part of; subtracting this from the connection pointer yields the file descriptor.

Implementing sequences in lwan template engine

When I wrote about lwan’s templating engine on a blog post last year, I purposedly ommitted the fact that it didn’t support sequences. Took me almost a year, but I’ve finally implemented it this week. (Lwan is usually a low priority weekend project. Maybe that should do as an excuse for my laziness.)

It took me three tries to get this right. Rube Goldberg machine kind of right, but there’s always some elegance in ingenuity.

The first try would require one to create the list beforehand, and then pass it to the template engine to render. Not only cumbersome, but would require the creation of (potentially) large amounts of temporary objects.

The latter reason lead me to think of a way to implement iterators in C. This is usually done using callbacks; and although performant, it gets pretty verbose and tedious as there is usually the need to create structures to keep state, and different callbacks to initialize, destroy, and advance the iterator.

Lots of things happened since then, and this feature sort of creeped under the low priority rug for the most part of a year.

While writing a completely different program in Python, however, it struck me: I could use the coroutine stuff I was already using in lwan and implement generator functions. A few minutes later and I had a working prototype, which can probably be better explained with the help of a diagram:

diagram

In short, the engine will create a coroutine whenever it finds a {{#sequence}} template tag. This coroutine will start, and will execute as usual until it yields.

Yielding false, the engine assumes the iteration ended, and proceeds to find the next matching {{/sequence}} tag to continue from there.

On a true yield, however, the engine will recurse to apply everything is between the iteration tags, repeating the process when the iteration-end tag is found and the coroutine yields true again.

The coroutine is supposed to clean up after itself before returning a false value.

rubegoldbergmachine

Professor Butts would be proud. Maybe. Source.

A sample generator function is shown below. It iterates over a directory with readdir(), returning 1 on new item availability and 0 when there isn’t anything else to do. Notice that initialization, iteration, and cleanup is all contained within a single function.

Also, notice that there’s no need to copy any values to and from the structure – calling coro_yield() will of course maintain the stack alive, so local variables can be used outside this function as long as a reference to them can be obtained.

int dir_list_generator(coro_t *coro)
{
    DIR *dir;
    struct dirent *ent;
    struct file_list *fl = coro_get_data(coro);

    dir = opendir(fl->path);
    if (!dir)
      return 0;

    while ((ent = readdir(dir))) {
      fl->list.name = ent->d_name;
      coro_yield(coro, 1);   /* !0 means "iter not done yet" */
    }

    closedir(dir);
    return 0;
}

The details of how the variable descriptors are set up are explained in the commit message that introduced this change. (The commit itself is quite buggy, but whatever I could find has been fixed in HEAD already.)

In an ideal world, one would use something akin to Golang’s Channels, but if I were to implement them in lwan it would take perhaps another year. Plus, they wouldn’t be as efficient as setting some pointers. But they might be useful in the future, so I’m not completely discarding the idea. Although I’ve never written a single line of Go code, I’m reading a lot about it recently and it is sort of positively impacting the way I think about programming. But I digress.

Partially Applied Functions in C

There are some functions in the standard C library that takes a function pointer to be used as a callback later on. Examples include atexit() and signal(). However, these functions can’t receive an arbitrary pointer (which could hold some important program state) in addition to the function pointer, so you’re left with pesky global variables:

/* You have: */
atexit(foo); /* foo() will have to fetch program state from globals */

/* Instead of: */
static struct program_state state;
atexit(foo, &state); /* foo() now have a pointer to program state */

Turns out that there’s a workaround, but it involves some black magic.

I believe the overall mechanism to be quite interesting, however I do not recommend its usage. Not only because the implementation wastes a whole memory page for a callback, but also because I don’t want to encourage people to perpetuate this kind of take-pointer-to-function-without-argument nonsense.

I’ll try to explain how this contraption works by showing the smaller parts first. I’ll begin with the template function. The idea is to have a function whose code can be patched up later – however that code turns out to be generated by the compiler:

#define PARAMETER_CONSTANT 0xFEEDBEEF
#define FUNCTION_CONSTANT 0xABAD1DEA

static void
partial_template_function(void)
{
    ((void (*)(void *))FUNCTION_CONSTANT)((void *)PARAMETER_CONSTANT);
}

The funky-looking cast basically says “call a function pointer at FUNCTION_CONSTANT with a pointer pointing to PARAMETER_CONSTANT”. Of course, if you call this code as is, the program will most likely crash. The idea is that this generates this code (IA32 assembly):

0f00deba <partial_template_function>:
   0:    55                       push   %ebp
   1:    89 e5                    mov    %esp,%ebp
   3:    83 ec 18                 sub    $0x18,%esp
   6:    c7 04 24 ef be ed fe     movl   $0xfeedbeef,(%esp)
   d:    b8 ea 1d ad ab           mov    $0xabad1dea,%eax
  12:    ff d0                    call   *%eax
  14:    c9                       leave
  15:    c3                       ret

Even if you don’t know assembly, if you squint a little bit, you can clearly see the magic constants defined in the C code above. By writing a trivial function to patch these magic values to something useful (such as a real function or some real pointer argument):

static bool
patch_pointer(void *code_addr, size_t code_len, void *look_for, void
*patch_with)
{
    unsigned char *code = code_addr;
    intptr_t look = (intptr_t)look_for;

    do {
        if (*((intptr_t *)code) == look) {
            union {
              unsigned char octet[sizeof(void *)];
              void *ptr;
            } patch;

            patch.ptr = patch_with;
            code[0] = patch.octet[0];
            code[1] = patch.octet[1];
            code[2] = patch.octet[2];
            code[3] = patch.octet[3];

            return true;
        }

        code++;
    } while (code_len--);

    return false;
}

And using it to patch the pointers in a page allocated with mmap() (comments and error recovery have been ommitted for brevity; full source code is linked below):

struct Partial *
partial_new(void (*func)(void *data), void *data)
{
    struct Partial *t;

    if (!func) return NULL;

    t = calloc(1, sizeof(*t));
    /* partial_template_function must be declared just before partial_new
     * so that caller_len is calculated correctly */
    t->caller_len = (size_t)((intptr_t)partial_new -
          (intptr_t)partial_template_function);

    t->caller = mmap(0, t->caller_len, PROT_WRITE | PROT_READ,
          MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

    memcpy(t->caller, partial_template_function, t->caller_len);

    patch_pointer(t->caller, t->caller_len, (void *)FUNCTION_CONSTANT, func);
    patch_pointer(t->caller, t->caller_len, (void *)PARAMETER_CONSTANT, data);

    mprotect(t->caller, t->caller_len, PROT_EXEC | PROT_READ);

    return t;
}

The end result will be a function that can be called without arguments – which will magically call another function with a given parameter:

static void
test(void *data)
{
    printf("Test called with data=%p\n", data);
}

int main(void)
{
    struct Partial *p;

    p = partial_new(test, (void *)0x12341337);
    atexit(partial_to_function(p));

    return 0;
}

Which, when executed, will print:

[leandro@navi /tmp]$ ./a.out
Test called with data=0x12341337

So there you have it, partially applied functions in C. Useful? Hardly. Interesting? I think so. Fun? Yup.

If you’d like to try, the full source code, with comments and error recovery is available in this gist.

Mustache templates in C

Generating textual output is a lot easier with templates than it is with handcrafted functions. And it is a lot easier in languages such as Python, where things like introspection are easy and cheap. But that doesn’t necessarily mean we can’t do that in C if we know where to look.

I’ve implemented a subset of Mustache templates in C that leverages some tricks that makes template rendering both convenient and efficient. For instance, if you have a template such as this:

Hello, {{name}}!

It can easily be rendered with the following code:

hello_t data = {
  .name = "World"
};
lwan_tpl_render(hello, &data);

Where hello is the template that was previously compiled into a series of simple instructions (such as append text or append the value of a variable), and the second parameter is a structure containing the data needed by the renderer.

My first thought to render these templates would involve the use of a hash table. While reasonably efficient (even considering the overhead to create and destroy the table every time the template had to be rendered), they’re not first class citizens in C, and the usage would be pretty clumsy, to say the least:

hash_table *ht = hash_table_new();
hash_table_add(ht, "name", "World");

lwan_tpl_render(hello, ht);

hash_table_free(ht);

Instead, I’ve decided to go on another road: use standard C structures to store the values in their native form, and then find a way to lookup these values whenever necessary to render the template.

The first trick, then, was to use a C99 feature called compound literals, which is supported by GCC even in C89 mode. This trick allows the use of anonymous arrays, among other things, and provides enough syntactic sugar to conveniently group the template variables:

lwan_tpl_render(hello, (hello_t[]) {{
  .name = "World"
}});

Without a way to lookup which value to obtain from the structure, however, this would not help much. Enter the second trick: the offsetof(3) macro, which computes the offset of a field in a given structure. By storing this offset alongside data type information, the value lookup is not only possible but can also work with types different than strings:

typedef struct hello_t {
  char *name;
  int age;
};
/*
 * The TPL_VAR_??? macros provides some convenience to declare each
 * descriptor. These expand to a declaration containing the name of
 * the variable as a string (used to validate the template during
 * compile time), the field offset, and pointers to functions that
 * convert the values to string and check if they're empty.
 *
 * The SENTINEL type is there so the template compiler knows when to
 * stop looking for descriptors, since of course you can have as
 many
 * fields as necessary.
 */
lwan_var_descriptor_t hello_descriptor[] = {
  TPL_VAR_STR(hello_t, name),
  TPL_VAR_INT(hello_t, age),
  TPL_VAR_SENTINEL
};
lwan_tpl_t *hello;
strbuf_t *rendered;

/*
 * ``hello'' would usually be compiled once and kept around for
 * the whole duration of the program.
 */
hello = lwan_tpl_compile("hello.tpl", hello_descriptor);

/*
 * Rendering the template then would be just the matter of calling
 * this function, which will output a ``strbuf_t''. The template
 * compiler estimates the starting size of this string buffer, so
 * rendering will incur in very few expensive reallocations, if
 * there are reallocations at all.
 */
rendered = lwan_tpl_render(hello, (hello_t[]) {{
  .name = "World",
  .age = 42
}});

printf("%s\n", strbuf_get_buffer(rendered));
strbuf_free(rendered);

Code for this engine is available in the wip branch of my toy web server, lwan. It is not currently used there, but it is built alongside the main program and can be tested by invoking the generated template executable.

Before using that in lwan, though, I’ll try to employ this nifty trick to JIT-compile the template and avoid some of the overhead where it really matters. While at the same time possibly opening a whole can of worms from the security standpoint, though – but it wouldn’t be fun without some risk, would it? :)