Mustache templates in C

Generating textual output is a lot easier with templates than it is with handcrafted functions. And it is a lot easier in languages such as Python, where things like introspection are easy and cheap. But that doesn’t necessarily mean we can’t do that in C if we know where to look.

I’ve implemented a subset of Mustache templates in C that leverages some tricks that makes template rendering both convenient and efficient. For instance, if you have a template such as this:

Hello, {{name}}!

It can easily be rendered with the following code:

hello_t data = {
  .name = "World"
};
lwan_tpl_render(hello, &data);

Where hello is the template that was previously compiled into a series of simple instructions (such as append text or append the value of a variable), and the second parameter is a structure containing the data needed by the renderer.

My first thought to render these templates would involve the use of a hash table. While reasonably efficient (even considering the overhead to create and destroy the table every time the template had to be rendered), they’re not first class citizens in C, and the usage would be pretty clumsy, to say the least:

hash_table *ht = hash_table_new();
hash_table_add(ht, "name", "World");

lwan_tpl_render(hello, ht);

hash_table_free(ht);

Instead, I’ve decided to go on another road: use standard C structures to store the values in their native form, and then find a way to lookup these values whenever necessary to render the template.

The first trick, then, was to use a C99 feature called compound literals, which is supported by GCC even in C89 mode. This trick allows the use of anonymous arrays, among other things, and provides enough syntactic sugar to conveniently group the template variables:

lwan_tpl_render(hello, (hello_t[]) {{
  .name = "World"
}});

Without a way to lookup which value to obtain from the structure, however, this would not help much. Enter the second trick: the offsetof(3) macro, which computes the offset of a field in a given structure. By storing this offset alongside data type information, the value lookup is not only possible but can also work with types different than strings:

typedef struct hello_t {
  char *name;
  int age;
};
/*
 * The TPL_VAR_??? macros provides some convenience to declare each
 * descriptor. These expand to a declaration containing the name of
 * the variable as a string (used to validate the template during
 * compile time), the field offset, and pointers to functions that
 * convert the values to string and check if they're empty.
 *
 * The SENTINEL type is there so the template compiler knows when to
 * stop looking for descriptors, since of course you can have as many
 * fields as necessary.
 */
lwan_var_descriptor_t hello_descriptor[] = {
  TPL_VAR_STR(hello_t, name),
  TPL_VAR_INT(hello_t, age),
  TPL_VAR_SENTINEL
};
lwan_tpl_t *hello;
strbuf_t *rendered;

/*
 * ``hello'' would usually be compiled once and kept around for
 * the whole duration of the program.
 */
hello = lwan_tpl_compile("hello.tpl", hello_descriptor);

/*
 * Rendering the template then would be just the matter of calling
 * this function, which will output a ``strbuf_t''. The template
 * compiler estimates the starting size of this string buffer, so
 * rendering will incur in very few expensive reallocations, if
 * there are reallocations at all.
 */
rendered = lwan_tpl_render(hello, (hello_t[]) {{
  .name = "World",
  .age = 42
}});

printf("%s\n", strbuf_get_buffer(rendered));
strbuf_free(rendered);

Code for this engine is available in the wip branch of my toy web server, lwan. It is not currently used there, but it is built alongside the main program and can be tested by invoking the generated template executable.

Before using that in lwan, though, I’ll try to employ this nifty trick to JIT-compile the template and avoid some of the overhead where it really matters. While at the same time possibly opening a whole can of worms from the security standpoint, though – but it wouldn’t be fun without some risk, would it? :)

Posted on 11-11-2012 in

Programming on an Arduino without a PC

I’ve attended this year’s FISL, both as a booth attendee (at ProFUSION’s booth, demonstrating a few of our end-user-visible projects), and as a speaker for my old FINF project.

FINF is a Forth-like programming environment that I’ve written in my first year at the college. It’s not the first compiler I wrote, but it was the first that was actually fun to write. Some years later, I’ve decided to rewrite it so that it would work on the Arduino – and that’s what I went to FISL to talk about.

audience

Arduinos are traditionally programmed by using its IDE, in a language that resembles C++. In fact, it is C++, but some of the (boring) details are hidden. But, being C++, it’s bound to the slow write-compile-upload-test procedures; there’s no interactive prompt, such as you have with Python or the venerable 8-bit Microsoft Basic. And since Arduino is all about experimentation, an interactive prompt is a must.

FINF is there to fill this gap. It is not a full FORTH implementation; only a small subset of it is there, but it’s enough to blink some LEDs, make some noise, and – if a video output shield is used – use the Arduino as an 8-bit computer! But, since user code actually runs on top of a very simple virtual machine due to the Harvard architecture used by the AVR microcontroller, it’s not possible to expand the interpreter without getting dirt in your hands. Add that to the quite messy code, mix it with myself not being a good marketer, and you have yet another failed open source project of mine! :)

In any case, the slides (in Portuguese) are available online. Unfortunately, the presentation was not recorded, so if you were not there, you’ve missed the great opportunity of seeing myself making a LED blink in front of an audience.

(By the way, I’ll be talking during EFL Developer Day in Barcelona early next month. If you’re there for LinuxCon/Embedded Linux Conference and would like to join me for some beers, don’t hesitate to contact me!)

Posted on 27-10-2012 in

Vectored I/O with mmap() to serve files

Previously, I’ve improved file serving performance in lwan by dramatially cutting down on the number of system calls performed to serve a file. However, for small files (< 16KiB), the throughput drop from the hello handler (which merely responds “Hello, World!”) was significant, since lwan was still performing system calls to open, obtain the size, and close the file.

I’ve experimented with userland caching before, but it never occurred to me to use mmap(). For the unitiated, this system call offers a way to map a file into memory, by giving a pointer to the process virtual memory space, that, when dereferenced, will perform the necessary disk I/O if the pages were not already present in the kernel buffers. Wikipedia has more details about it. Using mmap() greatly simplifies caching code by relaying it to the kernel, closer to where the low level buffers are.

By using a memory-mapped buffer and writev() (which the hello handler uses through lwan’s abstractions), the file serving performance improved about 60%! Before the optimization, weighttp would be able to make ~170000 requests/s. Now, ~286000 requests/s can be made. (That’s on my laptop, a Core i7 2640m, with 8GiB of RAM and without spinning platters.)

Of course, introducing caching also introduces a lot of complexity. Not only the file serving handler almost doubled its size (from 350 lines to 610 lines), but I’ve had to add a hash table implementation (with around 430 lines) and a directory watcher that uses inotify at around 150 lines of C code. In the order of 840 lines of code to improve performance by about 60%. About 30% more lines of code to improve performance in 60% – not bad, methinks.

On the other hand, the cache mechanism brings shared mutable state. This is protected by mutexes, of course, but I’m not sure if I got it right. One more reason to not use lwan in production.

As a bonus to these things, lwan now offers deflated content for the files in the cache when asked.

Posted on 14-10-2012 in

Asynchronous I/O in C with Coroutines

Writing asynchronous I/O code in C is kind of tedious, and often leads to a callback hell. But it doesn’t have to be this way; if you have a main loop, it’s quite simple to use coroutines and write code in a soothing, old school, synchronous way.

yodawg

Under POSIX, it’s also quite easy to implement coroutines, via the use of the stuff contained in the ucontext.h header. Unfortunately deprecated in favor of threads, the functions and structures found in this header are one of the unpopular gems in the POSIX C library.

Coroutines in lwan, my toy web server

In lwan, there are n + 1 worker threads, where n is the number of logical CPUs. One thread is the acceptor thread, which accepts connections, and gives control to the other n threads, which in turn do all the work of receiving a request, parsing, and delivering the response.

Each of these worker threads can multiplex thousands of connections by polling on events: so, for each worker thread, only one request can be handled at a time. And, if one blocking operation (say, write to a socket) would block, all other requests would wait for a response.

By yielding the coroutine at the right moments, lwan blocks only on calls to epoll(4). Whenever the socket can be written again, that coroutine is resumed. The request handler function does not know what happened.

Implementing coroutines using ucontext

As said earlier, POSIX offers some infrastructure that can be used to implement coroutines or more powerful concepts, like call/cc. However, they’re quite tricky to use, so it’s often a good idea to offer a thin wrapper on top of them. The API used in lwan is the following (implementation details omitted for brevity):

coro_t *coro_new(coro_function_t function, void *data);
void coro_free(coro_t *coro);
int coro_resume(coro_t *coro);
void coro_yield(coro_t *coro, int value);

Coroutine allocation

A coroutine is pretty much a simple data structure; the real implementation has more fields, but they’re implementation details:

struct coro_t_ {
	coro_function_t function;
	void *data;

	ucontext_t context;
	int yield_value;

	char stack[1];
};

To allocate one, just allocate space for sizeof(coro_t_) + stack_size, initialize the variables, and call getcontext(&coro->context). (These context-swapping functions are weirdly-named in my opinion: getcontext actually saves the current context into the variable pointed to by its sole parameter.)

After that, one just need to set up the context so that it points to the newly-allocated stack:

coro->context.uc_stack.ss_sp = coro->stack;
coro->context.uc_stack.ss_size = stack_size;
coro->context.uc_stack.ss_flags = 0;
coro->context.uc_link = NULL;

And then call makecontext() so that the coroutine entry point can be called when resuming the coroutine. (Another weirdly-named function. No wonder why this thing is quite unpopular.) This function takes a variable number of parameters, each one being a 32-bit integer value. I don’t know why a void * wasn’t used instead – so, on 64-bit, I use an union to break a pointer into two 32-bit components:

union ptr_splitter {
	void *ptr;
	uint32_t part[2];
};

That’s it to allocate a coroutine.

Freeing a coroutine

Just free the coroutine’s coro_t structure. Lwan’s implementation does run the deferred statements at this moment.

Resuming a coroutine

Resuming a coroutine pretty simple: one has to save the current context, swap the current context with the coroutine context, and when the coroutine yields, return the contexts where they were.

Yielding from a coroutine

To yield from a coroutine, just save value into coro_t’s yield_value field, and make a call to swapcontext(), swapping the current coroutine stack with the context that was active when the coroutine was resumed (which happens to be the routine that resumes a coroutine – which now cleans up and return to whoever called it, most probably the main loop). value is now available to whoever called coro_resume() and is used in lwan to determine if a coroutine should be resumed.

Using the coroutines

From the user perspective, it’s just like calling some blocking function:

lwan_sendfile(request->socket, file_fd, 0, total_bytes);

Behind the scenes, lwan_sendfile is actually doing this (error handling omitted for brevity):

while (sent_bytes < total_bytes) {
	read_bytes = read(file_fd, buffer, buffer_size);
	sent_bytes += read_bytes;
	write(socket_fd, buffer, read_bytes);

	coro_yield(coro, 1);
}

(Of course, if available, sendfile(2) is used instead in a similar fashion, but this better illustrates the point.)

Whenever the coroutine yields, it goes back to the main loop, which is now free to resume another coroutine. Ideally, one could yield to be resumed on a certain condition (instead of assuming that the condition is just “the socket is ready to be written to”), but this isn’t possible in the current implementation.

For implementation simplicity, the same timer code that is used for keep-alive connections is used for coroutines, so that they don’t linger indefinitely.

Implementation details

  • On 64-bit, hand-tuned assembly versions of ucontext routines are used. These routines avoid saving and restoring the signal mask (avoiding two roundtrips to the kernel), and does not save the floating point registers.
  • Also, on 64-bit, resuming a coroutine is orders of magnitude faster, since not everything is copied when switching contexts.
  • Swapping stacks makes tools like Valgrind get pretty crazy. Lwan’s implementation uses Valgrind-provided macros that marks the newly-allocated blocks (from the heap) as stacks.
  • The real implementation has a coro_switcher_t data structure. This structure is used to both avoid race conditions when swapping coroutines in different threads, but also to maintain coroutine state from different threads.

There are other details that were ommitted from this post. Lwan’s source code is small enough it can digested easily, and if you’re not sleeping already, check it out.

Closing notes

Although not as performant as the traditional way of using callbacks (resuming and yielding from coroutines are a little bit more expensive than calling a function), coroutines brings a lot of simplicity when writing asynchronous code.

The example shown here might not be expressive, but imagine an application fetching data from a key-value store from another machine: there might be dozens of calls to the database to build a web page, which would be pretty difficult to handle if there were dozen callbacks. With a synchronous style, that would be a lot easier to write and maintain.

One could argue that the same thing could be done in threads. But creating more threads than there are processors will often hurt performance (for various reasons) in noticeable ways. Also, coroutines are cheaper on the memory requirements: in Lwan, they sport 16KiB of stack space and it takes a little bit more than a malloc() to set them up.

I believe we should stop using callbacks for asynchronous I/O and use things like this. Even if ucontext.h is deprecated from POSIX, the functions a fairly trivial to write (even in assembly language) – actually, encouraged, given that swapcontext() and getcontext() makes often unnecessary system calls.

Posted on 29-09-2012 in

Presenting EasyUI

Introduction

I’ve been working at ProFUSION on a project called EasyUI for the past few months. This library is based on Google’s V8 JavaScript engine and the Enlightenment Foundation Libraries and aims to diminish the hurdle in writing native applications for the forthcoming Tizen platform.

EFL itself – specially its UI toolkit, Elementary – follows a pretty traditional approach to creating applications: the library user must know how to join all bits and pieces, which often leads to common code that is written and rewritten in each new application.

By observing common patterns and providing an uniform MVC interface, EasyUI parts from this traditional approach and offers a new way to create applications using EFL.

Before talking about code, let me show a video (best viewed in HD) of some sample EasyUI applications being executed:

Brief overview of an EasyUI app

An app begins with a simple call to EUI.app(), passing at least one parameter: the main view-controller. The other parameter is an optional object that contains application settings, such as the theme or window titles. EasyUI applications are contained inside one window only, since the main focus are apps for mobile devices with stacked lists and the eventual popup that appears on top of the content.

The code below shows a typical controller-view; explanation will follow.

MainController = EUI.ListController({
	title: 'My Favorite Fruits',
	model: new ArrayModel(['Pear', 'Banana', 'Uvaia']),
	itemAtIndex: function(index) {
		return {
			text: this.model.itemAtIndex(index)
		};
	},
	selectedItemAtIndex: function(index) {
		var fruit = this.model.itemAtIndex(index);
		this.pushController(new FruitController(fruit));
	},
	navigationBarItems: { right: 'Add' },
	selectedNavigationBarItem: function(item) {
		if (item === 'Add')
			this.pushController(new AddFruitController);
	}
});

Lots of things happens with the declaration of MainController. Of note:

  • As opposed to the traditional way of laying out components on screen with EFL, controllers implements the basic user interface and the application developer focuses only on defining behavior.
  • Attributes can be functions, which will be called whenever EasyUI needs them. For instance, one could change the title based on how many fruits were in the model, by writing a title function like so:
function() {
	if (this.model.length() == 1)
		return "My Favorite Fruit";
	return "My Favourite Fruits";
}
  • MainController is derived from ListController. This means that it follows a collection contract, and must implement methods like itemAtIndex and selectedItemAtIndex, as well as providing a model attribute.
  • One could simply swap ListController for GridController if a grid layout were to be more appropriate for this particular application.
  • In addition to the basic contract, controllers might sign for more; for instance by declaring navigationBarItems, one is required to implement selectedNavigationBarItem.

Behind the scenes, the framework will initialize the EFL, create the window and required widgets, listen to callbacks – and call the application code in appropriate moments.

The next post in this series will show the anatomy of a Reddit client.

Show me the code

We’re just working on some licensing issues right now. This should be released as an open source project. As soon as this is cleared up, EasyUI should hit Enlightenment’s SVN repository.

Posted on 21-09-2012 in

File serving with few system calls

When I first wrote lwan, file serving was not a primary goal. I’ve added this capability later, never giving much thought to the number of system calls required to serve one file. As a result, static file serving was quite slow compared to “hello world” requests. Bored one day, I’ve decided to speed this as much as I could.

Before optimizing, serving a file would look like this:

<... epoll_wait resumed> {{EPOLLIN, {u32=8, u64=8}}}, 16383, 4294967295) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(8, "GET / HTTP/1.0\r\n", 4096) = 16
getcwd("/home/leandro/git/lwan/build", 4096) = 29
lstat("/home/leandro/git/lwan/build/files_root", {st_mode=S_IFLNK|0777, st_size=33, ...}) = 0
readlink("/home/leandro/git/lwan/build/files_root", "/home/leandro/git/blotest/output/", 4095) = 33
lstat("/home", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/home/leandro", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/home/leandro/git", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
lstat("/home/leandro/git/blotest", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/home/leandro/git/blotest/output", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
getcwd("/home/leandro/git/lwan/build", 4096) = 29
lstat("/home/leandro/git/lwan/build/files_root", {st_mode=S_IFLNK|0777, st_size=33, ...}) = 0
readlink("/home/leandro/git/lwan/build/files_root", "/home/leandro/git/blotest/output/", 4095) = 33
lstat("/home", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/home/leandro", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/home/leandro/git", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
lstat("/home/leandro/git/blotest", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/home/leandro/git/blotest/output", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
open("/home/leandro/git/blotest/output", O_RDONLY|O_NOATIME) = 9
fstat(9, {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
close(9)                    = 0
open("/home/leandro/git/blotest/output/index.html", O_RDONLY|O_NOATIME) = 9
fstat(9, {st_mode=S_IFREG|0664, st_size=13200, ...}) = 0
setsockopt(8, SOL_TCP, TCP_CORK, [1], 4) = 0
write(8, "HTTP/1.0 200 OK\r\nContent-Length:"..., 100) = 100
fadvise64(9, 0, 13200, POSIX_FADV_SEQUENTIAL) = 0
sendfile(8, 9, NULL, 1400)  = 1400
epoll_ctl(6, EPOLL_CTL_MOD, 8, {EPOLLOUT|EPOLLERR|0x2000, {u32=8, u64=8}}) = 0
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}}, 16383, 1000) = 1
sendfile(8, 9, NULL, 1400)  = 1400
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}}, 16383, 1000) = 1
sendfile(8, 9, NULL, 1400)  = 1400
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}}, 16383, 1000) = 1
sendfile(8, 9, NULL, 1400)  = 1400
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}}, 16383, 1000) = 1
sendfile(8, 9, NULL, 1400)  = 1400
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}}, 16383, 1000) = 1
sendfile(8, 9, NULL, 1400)  = 1400
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}}, 16383, 1000) = 1
sendfile(8, 9, NULL, 1400)  = 1400
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}}, 16383, 1000) = 1
sendfile(8, 9, NULL, 1400)  = 1400
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}}, 16383, 1000) = 1
sendfile(8, 9, NULL, 1400)  = 1400
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}}, 16383, 1000) = 1
sendfile(8, 9, NULL, 1400)  = 600
close(9)                    = 0
epoll_ctl(6, EPOLL_CTL_MOD, 8, {EPOLLIN|EPOLLERR|EPOLLET|0x2000, {u32=8, u64=8}}) = 0
close(8)                    = 0

Yes. That many system calls – I was not kidding when I said that file serving was added as an afterthought. After some experiments, I’ve managed to turn that mess into this:

<... epoll_wait resumed> {{EPOLLIN, {u32=9, u64=9}}}, 16383, 4294967295) = 1
read(9, "GET / HTTP/1.0\r\n", 4096) = 16
newfstatat(7, "index.html", {st_mode=S_IFREG|0664, st_size=13200, ...}, 0) = 0
openat(7, "index.html", O_RDONLY|O_NOATIME) = 10
sendto(9, "HTTP/1.0 200 OK\r\nContent-Length:"..., 223, MSG_MORE, NULL, 0) = 223
fadvise64(10, 0, 13200, POSIX_FADV_SEQUENTIAL) = 0
sendfile(9, 10, [0], 13200) = 13200
close(10)                   = 0

Ah, much better! This was a result of these steps:

  1. Caching the root directory information. This is mainly its realpath() and an open file descriptor to the directory.
    • The realpath() result is used with a strncmp() after so that requests to file outside the root directory are not served successfully. The string length for the root path is also calculated only once.
    • The astute reader will notice the usage of the Linux-only openat() and newfstatat() system calls. These -at() variants perform much like their standard ones, but they work relative to the directory pointed to the file descriptor passed as the first parameter, avoiding some expensive path-to-inode conversions.
  2. Using sendto() with MSG_MORE flag instead of using TCP_CORK flag. This makes for two less roundtrips to the kernel to set a socket option.
  3. Using sendfile() with the whole file instead of sending it in chunks, to avoid coroutine context switches. sendfile() might still block, but in this case, the coroutine will yield and the next time, try to send a smaller chunk.
  4. The -at() version of system calls are also being used by a replacement realpath() routine that I’ve adapted from glibc. This improved the performance as well by not only reducing the number of system calls (the standard realpath() will perform a lstat() for every path component, whereas this version will only perform newfstatat() for relative components), but also using the lighter -at() variants.

These improvements resulted in very low overhead while serving files. In fact, compared to a simple hello world handler and file serving – without keep-alive – the performance drop even comparing the I/O involved is about 5%.

Posted on 12-08-2012 in