Mustache templates in C
Generating textual output is a lot easier with templates than it is with handcrafted functions. And it is a lot easier in languages such as Python, where things like introspection are easy and cheap. But that doesn’t necessarily mean we can’t do that in C if we know where to look.
I’ve implemented a subset of Mustache templates in C that leverages some tricks that makes template rendering both convenient and efficient. For instance, if you have a template such as this:
Hello, {{name}}!
It can easily be rendered with the following code:
hello_t data = {
.name = "World"
};
lwan_tpl_render(hello, &data);
Where hello is the template that was previously compiled into a series of
simple instructions (such as append text or append the value of a
variable), and the second parameter is a structure containing the data
needed by the renderer.
My first thought to render these templates would involve the use of a hash table. While reasonably efficient (even considering the overhead to create and destroy the table every time the template had to be rendered), they’re not first class citizens in C, and the usage would be pretty clumsy, to say the least:
hash_table *ht = hash_table_new();
hash_table_add(ht, "name", "World");
lwan_tpl_render(hello, ht);
hash_table_free(ht);
Instead, I’ve decided to go on another road: use standard C structures to store the values in their native form, and then find a way to lookup these values whenever necessary to render the template.
The first trick, then, was to use a C99 feature called compound literals, which is supported by GCC even in C89 mode. This trick allows the use of anonymous arrays, among other things, and provides enough syntactic sugar to conveniently group the template variables:
lwan_tpl_render(hello, (hello_t[]) {{
.name = "World"
}});
Without a way to lookup which value to obtain from the structure, however,
this would not help much. Enter the second trick: the offsetof(3) macro, which
computes the offset of a field in a given structure. By storing this offset
alongside data type information, the value lookup is not only possible but
can also work with types different than strings:
typedef struct hello_t {
char *name;
int age;
};
/*
* The TPL_VAR_??? macros provides some convenience to declare each
* descriptor. These expand to a declaration containing the name of
* the variable as a string (used to validate the template during
* compile time), the field offset, and pointers to functions that
* convert the values to string and check if they're empty.
*
* The SENTINEL type is there so the template compiler knows when to
* stop looking for descriptors, since of course you can have as many
* fields as necessary.
*/
lwan_var_descriptor_t hello_descriptor[] = {
TPL_VAR_STR(hello_t, name),
TPL_VAR_INT(hello_t, age),
TPL_VAR_SENTINEL
};
lwan_tpl_t *hello;
strbuf_t *rendered;
/*
* ``hello'' would usually be compiled once and kept around for
* the whole duration of the program.
*/
hello = lwan_tpl_compile("hello.tpl", hello_descriptor);
/*
* Rendering the template then would be just the matter of calling
* this function, which will output a ``strbuf_t''. The template
* compiler estimates the starting size of this string buffer, so
* rendering will incur in very few expensive reallocations, if
* there are reallocations at all.
*/
rendered = lwan_tpl_render(hello, (hello_t[]) {{
.name = "World",
.age = 42
}});
printf("%s\n", strbuf_get_buffer(rendered));
strbuf_free(rendered);
Code for this engine is available in the wip branch of my toy web server,
lwan. It is not currently used there, but it is built alongside the main
program and can be tested by invoking the generated template executable.
Before using that in lwan, though, I’ll try to employ this nifty trick to JIT-compile the template and avoid some of the overhead where it really matters. While at the same time possibly opening a whole can of worms from the security standpoint, though – but it wouldn’t be fun without some risk, would it? :)
Posted on 11-11-2012 in C tricks template lwan❖
Programming on an Arduino without a PC
I’ve attended this year’s FISL, both as a booth attendee (at ProFUSION’s booth, demonstrating a few of our end-user-visible projects), and as a speaker for my old FINF project.
FINF is a Forth-like programming environment that I’ve written in my first year at the college. It’s not the first compiler I wrote, but it was the first that was actually fun to write. Some years later, I’ve decided to rewrite it so that it would work on the Arduino – and that’s what I went to FISL to talk about.
Arduinos are traditionally programmed by using its IDE, in a language that resembles C++. In fact, it is C++, but some of the (boring) details are hidden. But, being C++, it’s bound to the slow write-compile-upload-test procedures; there’s no interactive prompt, such as you have with Python or the venerable 8-bit Microsoft Basic. And since Arduino is all about experimentation, an interactive prompt is a must.
FINF is there to fill this gap. It is not a full FORTH implementation; only a small subset of it is there, but it’s enough to blink some LEDs, make some noise, and – if a video output shield is used – use the Arduino as an 8-bit computer! But, since user code actually runs on top of a very simple virtual machine due to the Harvard architecture used by the AVR microcontroller, it’s not possible to expand the interpreter without getting dirt in your hands. Add that to the quite messy code, mix it with myself not being a good marketer, and you have yet another failed open source project of mine! :)
In any case, the slides (in Portuguese) are available online. Unfortunately, the presentation was not recorded, so if you were not there, you’ve missed the great opportunity of seeing myself making a LED blink in front of an audience.
(By the way, I’ll be talking during EFL Developer Day in Barcelona early next month. If you’re there for LinuxCon/Embedded Linux Conference and would like to join me for some beers, don’t hesitate to contact me!)
Posted on 27-10-2012 in conferences arduino finf profusion❖
Vectored I/O with mmap() to serve files
Previously, I’ve improved file serving performance in lwan by
dramatially cutting down on the number of system calls performed to serve a
file. However, for small files (< 16KiB), the throughput drop from the hello
handler (which merely responds “Hello, World!”) was significant, since lwan
was still performing system calls to open, obtain the size, and close the
file.
I’ve experimented with userland caching before, but it never occurred to me
to use mmap(). For the unitiated, this system call offers a way to map a
file into memory, by giving a pointer to the process virtual memory space,
that, when dereferenced, will perform the necessary disk I/O if the pages
were not already present in the kernel buffers. Wikipedia has more
details about it. Using mmap() greatly simplifies caching code by
relaying it to the kernel, closer to where the low level buffers are.
By using a memory-mapped buffer and writev() (which the hello handler
uses through lwan’s abstractions), the file serving performance improved
about 60%! Before the optimization, weighttp would be able to make
~170000 requests/s. Now, ~286000 requests/s can be made. (That’s on my
laptop, a Core i7 2640m, with 8GiB of RAM and without spinning platters.)
Of course, introducing caching also introduces a lot of complexity. Not only the file serving handler almost doubled its size (from 350 lines to 610 lines), but I’ve had to add a hash table implementation (with around 430 lines) and a directory watcher that uses inotify at around 150 lines of C code. In the order of 840 lines of code to improve performance by about 60%. About 30% more lines of code to improve performance in 60% – not bad, methinks.
On the other hand, the cache mechanism brings shared mutable state. This is protected by mutexes, of course, but I’m not sure if I got it right. One more reason to not use lwan in production.
As a bonus to these things, lwan now offers deflated content for the files in the cache when asked.
Posted on 14-10-2012 in programming lwan C linux❖
Asynchronous I/O in C with Coroutines
Writing asynchronous I/O code in C is kind of tedious, and often leads to a callback hell. But it doesn’t have to be this way; if you have a main loop, it’s quite simple to use coroutines and write code in a soothing, old school, synchronous way.
Under POSIX, it’s also quite easy to implement coroutines, via the use of the stuff contained in the ucontext.h header. Unfortunately deprecated in favor of threads, the functions and structures found in this header are one of the unpopular gems in the POSIX C library.
Coroutines in lwan, my toy web server
In lwan, there are n + 1 worker threads, where n is the number of
logical CPUs. One thread is the acceptor thread, which accepts
connections, and gives control to the other n threads, which in turn do
all the work of receiving a request, parsing, and delivering the response.
Each of these worker threads can multiplex thousands of connections by polling on events: so, for each worker thread, only one request can be handled at a time. And, if one blocking operation (say, write to a socket) would block, all other requests would wait for a response.
By yielding the coroutine at the right moments, lwan blocks only on calls to epoll(4). Whenever the socket can be written again, that coroutine is resumed. The request handler function does not know what happened.
Implementing coroutines using ucontext
As said earlier, POSIX offers some infrastructure that can be used to implement coroutines or more powerful concepts, like call/cc. However, they’re quite tricky to use, so it’s often a good idea to offer a thin wrapper on top of them. The API used in lwan is the following (implementation details omitted for brevity):
coro_t *coro_new(coro_function_t function, void *data);
void coro_free(coro_t *coro);
int coro_resume(coro_t *coro);
void coro_yield(coro_t *coro, int value);
Coroutine allocation
A coroutine is pretty much a simple data structure; the real implementation has more fields, but they’re implementation details:
struct coro_t_ {
coro_function_t function;
void *data;
ucontext_t context;
int yield_value;
char stack[1];
};
To allocate one, just allocate space for sizeof(coro_t_) + stack_size,
initialize the variables, and call getcontext(&coro->context).
(These context-swapping functions are weirdly-named in my opinion: getcontext
actually saves the current context into the variable pointed to by its
sole parameter.)
After that, one just need to set up the context so that it points to the newly-allocated stack:
coro->context.uc_stack.ss_sp = coro->stack;
coro->context.uc_stack.ss_size = stack_size;
coro->context.uc_stack.ss_flags = 0;
coro->context.uc_link = NULL;
And then call makecontext() so that the coroutine entry point can be
called when resuming the coroutine. (Another weirdly-named function. No
wonder why this thing is quite unpopular.) This function takes a variable
number of parameters, each one being a 32-bit integer value. I don’t know
why a void * wasn’t used instead – so, on 64-bit, I use an union to break
a pointer into two 32-bit components:
union ptr_splitter {
void *ptr;
uint32_t part[2];
};
That’s it to allocate a coroutine.
Freeing a coroutine
Just free the coroutine’s coro_t structure. Lwan’s implementation does run
the deferred statements at this moment.
Resuming a coroutine
Resuming a coroutine pretty simple: one has to save the current context, swap the current context with the coroutine context, and when the coroutine yields, return the contexts where they were.
Yielding from a coroutine
To yield from a coroutine, just save value into coro_t’s yield_value
field, and make a call to swapcontext(), swapping the current coroutine
stack with the context that was active when the coroutine was resumed (which
happens to be the routine that resumes a coroutine – which now cleans up
and return to whoever called it, most probably the main loop). value is
now available to whoever called coro_resume() and is used in lwan to
determine if a coroutine should be resumed.
Using the coroutines
From the user perspective, it’s just like calling some blocking function:
lwan_sendfile(request->socket, file_fd, 0, total_bytes);
Behind the scenes, lwan_sendfile is actually doing this (error handling
omitted for brevity):
while (sent_bytes < total_bytes) {
read_bytes = read(file_fd, buffer, buffer_size);
sent_bytes += read_bytes;
write(socket_fd, buffer, read_bytes);
coro_yield(coro, 1);
}
(Of course, if available, sendfile(2) is used instead in a similar fashion, but this better illustrates the point.)
Whenever the coroutine yields, it goes back to the main loop, which is now free to resume another coroutine. Ideally, one could yield to be resumed on a certain condition (instead of assuming that the condition is just “the socket is ready to be written to”), but this isn’t possible in the current implementation.
For implementation simplicity, the same timer code that is used for keep-alive connections is used for coroutines, so that they don’t linger indefinitely.
Implementation details
- On 64-bit, hand-tuned assembly versions of
ucontextroutines are used. These routines avoid saving and restoring the signal mask (avoiding two roundtrips to the kernel), and does not save the floating point registers. - Also, on 64-bit, resuming a coroutine is orders of magnitude faster, since not everything is copied when switching contexts.
- Swapping stacks makes tools like Valgrind get pretty crazy. Lwan’s implementation uses Valgrind-provided macros that marks the newly-allocated blocks (from the heap) as stacks.
- The real implementation has a
coro_switcher_tdata structure. This structure is used to both avoid race conditions when swapping coroutines in different threads, but also to maintain coroutine state from different threads.
There are other details that were ommitted from this post. Lwan’s source code is small enough it can digested easily, and if you’re not sleeping already, check it out.
Closing notes
Although not as performant as the traditional way of using callbacks (resuming and yielding from coroutines are a little bit more expensive than calling a function), coroutines brings a lot of simplicity when writing asynchronous code.
The example shown here might not be expressive, but imagine an application fetching data from a key-value store from another machine: there might be dozens of calls to the database to build a web page, which would be pretty difficult to handle if there were dozen callbacks. With a synchronous style, that would be a lot easier to write and maintain.
One could argue that the same thing could be done in threads. But creating
more threads than there are processors will often hurt performance (for
various reasons) in noticeable ways. Also, coroutines are cheaper on the
memory requirements: in Lwan, they sport 16KiB of stack space and it takes a
little bit more than a malloc() to set them up.
I believe we should stop using callbacks for asynchronous I/O and use things
like this. Even if ucontext.h is deprecated from POSIX, the functions a
fairly trivial to write (even in assembly language) – actually, encouraged,
given that swapcontext() and getcontext() makes often unnecessary system
calls.
❖
Presenting EasyUI
Introduction
I’ve been working at ProFUSION on a project called EasyUI for the past few months. This library is based on Google’s V8 JavaScript engine and the Enlightenment Foundation Libraries and aims to diminish the hurdle in writing native applications for the forthcoming Tizen platform.
EFL itself – specially its UI toolkit, Elementary – follows a pretty traditional approach to creating applications: the library user must know how to join all bits and pieces, which often leads to common code that is written and rewritten in each new application.
By observing common patterns and providing an uniform MVC interface, EasyUI parts from this traditional approach and offers a new way to create applications using EFL.
Before talking about code, let me show a video (best viewed in HD) of some sample EasyUI applications being executed:
Brief overview of an EasyUI app
An app begins with a simple call to EUI.app(), passing at least one parameter: the
main view-controller. The other parameter is an optional object that
contains application settings, such as the theme or window titles. EasyUI
applications are contained inside one window only, since the main focus are
apps for mobile devices with stacked lists and the eventual popup that
appears on top of the content.
The code below shows a typical controller-view; explanation will follow.
MainController = EUI.ListController({
title: 'My Favorite Fruits',
model: new ArrayModel(['Pear', 'Banana', 'Uvaia']),
itemAtIndex: function(index) {
return {
text: this.model.itemAtIndex(index)
};
},
selectedItemAtIndex: function(index) {
var fruit = this.model.itemAtIndex(index);
this.pushController(new FruitController(fruit));
},
navigationBarItems: { right: 'Add' },
selectedNavigationBarItem: function(item) {
if (item === 'Add')
this.pushController(new AddFruitController);
}
});
Lots of things happens with the declaration of MainController. Of note:
- As opposed to the traditional way of laying out components on screen with EFL, controllers implements the basic user interface and the application developer focuses only on defining behavior.
- Attributes can be functions, which will be called whenever EasyUI needs
them. For instance, one could change the title based on how many fruits were
in the model, by writing a
titlefunction like so:
function() {
if (this.model.length() == 1)
return "My Favorite Fruit";
return "My Favourite Fruits";
}
MainControlleris derived fromListController. This means that it follows a collection contract, and must implement methods likeitemAtIndexandselectedItemAtIndex, as well as providing amodelattribute.- One could simply swap
ListControllerforGridControllerif a grid layout were to be more appropriate for this particular application. - In addition to the basic contract, controllers might sign for more; for
instance by declaring
navigationBarItems, one is required to implementselectedNavigationBarItem.
Behind the scenes, the framework will initialize the EFL, create the window and required widgets, listen to callbacks – and call the application code in appropriate moments.
The next post in this series will show the anatomy of a Reddit client.
Show me the code
We’re just working on some licensing issues right now. This should be released as an open source project. As soon as this is cleared up, EasyUI should hit Enlightenment’s SVN repository.
Posted on 21-09-2012 in programming profusion javascript tizen efl❖
File serving with few system calls
When I first wrote lwan, file serving was not a primary goal. I’ve added this capability later, never giving much thought to the number of system calls required to serve one file. As a result, static file serving was quite slow compared to “hello world” requests. Bored one day, I’ve decided to speed this as much as I could.
Before optimizing, serving a file would look like this:
<... epoll_wait resumed> {{EPOLLIN, {u32=8, u64=8}}}, 16383, 4294967295) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(8, "GET / HTTP/1.0\r\n", 4096) = 16
getcwd("/home/leandro/git/lwan/build", 4096) = 29
lstat("/home/leandro/git/lwan/build/files_root", {st_mode=S_IFLNK|0777, st_size=33, ...}) = 0
readlink("/home/leandro/git/lwan/build/files_root", "/home/leandro/git/blotest/output/", 4095) = 33
lstat("/home", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/home/leandro", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/home/leandro/git", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
lstat("/home/leandro/git/blotest", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/home/leandro/git/blotest/output", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
getcwd("/home/leandro/git/lwan/build", 4096) = 29
lstat("/home/leandro/git/lwan/build/files_root", {st_mode=S_IFLNK|0777, st_size=33, ...}) = 0
readlink("/home/leandro/git/lwan/build/files_root", "/home/leandro/git/blotest/output/", 4095) = 33
lstat("/home", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/home/leandro", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/home/leandro/git", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
lstat("/home/leandro/git/blotest", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/home/leandro/git/blotest/output", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
open("/home/leandro/git/blotest/output", O_RDONLY|O_NOATIME) = 9
fstat(9, {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
close(9) = 0
open("/home/leandro/git/blotest/output/index.html", O_RDONLY|O_NOATIME) = 9
fstat(9, {st_mode=S_IFREG|0664, st_size=13200, ...}) = 0
setsockopt(8, SOL_TCP, TCP_CORK, [1], 4) = 0
write(8, "HTTP/1.0 200 OK\r\nContent-Length:"..., 100) = 100
fadvise64(9, 0, 13200, POSIX_FADV_SEQUENTIAL) = 0
sendfile(8, 9, NULL, 1400) = 1400
epoll_ctl(6, EPOLL_CTL_MOD, 8, {EPOLLOUT|EPOLLERR|0x2000, {u32=8, u64=8}}) = 0
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}}, 16383, 1000) = 1
sendfile(8, 9, NULL, 1400) = 1400
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}}, 16383, 1000) = 1
sendfile(8, 9, NULL, 1400) = 1400
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}}, 16383, 1000) = 1
sendfile(8, 9, NULL, 1400) = 1400
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}}, 16383, 1000) = 1
sendfile(8, 9, NULL, 1400) = 1400
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}}, 16383, 1000) = 1
sendfile(8, 9, NULL, 1400) = 1400
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}}, 16383, 1000) = 1
sendfile(8, 9, NULL, 1400) = 1400
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}}, 16383, 1000) = 1
sendfile(8, 9, NULL, 1400) = 1400
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}}, 16383, 1000) = 1
sendfile(8, 9, NULL, 1400) = 1400
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}}, 16383, 1000) = 1
sendfile(8, 9, NULL, 1400) = 600
close(9) = 0
epoll_ctl(6, EPOLL_CTL_MOD, 8, {EPOLLIN|EPOLLERR|EPOLLET|0x2000, {u32=8, u64=8}}) = 0
close(8) = 0
Yes. That many system calls – I was not kidding when I said that file serving was added as an afterthought. After some experiments, I’ve managed to turn that mess into this:
<... epoll_wait resumed> {{EPOLLIN, {u32=9, u64=9}}}, 16383, 4294967295) = 1
read(9, "GET / HTTP/1.0\r\n", 4096) = 16
newfstatat(7, "index.html", {st_mode=S_IFREG|0664, st_size=13200, ...}, 0) = 0
openat(7, "index.html", O_RDONLY|O_NOATIME) = 10
sendto(9, "HTTP/1.0 200 OK\r\nContent-Length:"..., 223, MSG_MORE, NULL, 0) = 223
fadvise64(10, 0, 13200, POSIX_FADV_SEQUENTIAL) = 0
sendfile(9, 10, [0], 13200) = 13200
close(10) = 0
Ah, much better! This was a result of these steps:
- Caching the root directory information. This is mainly its
realpath()and an open file descriptor to the directory.- The
realpath()result is used with astrncmp()after so that requests to file outside the root directory are not served successfully. The string length for the root path is also calculated only once. - The astute reader will notice the usage of the Linux-only
openat()andnewfstatat()system calls. These-at()variants perform much like their standard ones, but they work relative to the directory pointed to the file descriptor passed as the first parameter, avoiding some expensive path-to-inode conversions.
- The
- Using
sendto()withMSG_MOREflag instead of usingTCP_CORKflag. This makes for two less roundtrips to the kernel to set a socket option. - Using
sendfile()with the whole file instead of sending it in chunks, to avoid coroutine context switches.sendfile()might still block, but in this case, the coroutine will yield and the next time, try to send a smaller chunk. - The
-at()version of system calls are also being used by a replacementrealpath()routine that I’ve adapted from glibc. This improved the performance as well by not only reducing the number of system calls (the standardrealpath()will perform alstat()for every path component, whereas this version will only performnewfstatat()for relative components), but also using the lighter-at()variants.
These improvements resulted in very low overhead while serving files. In fact, compared to a simple hello world handler and file serving – without keep-alive – the performance drop even comparing the I/O involved is about 5%.
Posted on 12-08-2012 in programming lwan C linux strace❖

