Lwan: 5 years in snippets
Around five years ago, I wrote a blog post that went though the life-cycle of a HTTP request, as seen by my toy web server, Lwan. It was a surprisingly popular article, not only raising visibility for my toy project, but also generating some discussions on link aggregator web sites and personal emails. (You can read the blog post if you haven’t already or need some refreshing.)
While I haven’t been working on it with the vigor I had in its first few years, a few things changed nonetheless; this article is a follow-up article on that. (Which I recommend reading, as out-of-date as it is: most things in the server didn’t change.)
Items in this article aren’t in any particular order; it’s just a collection of changes that happened since the original article was written. Not everything is mentioned here, of course, but should give an idea of the kind of work that kept me busy during some lazy Sundays.
Main Loop Changes
The main loop was designed to either wait indefinitely, or time out every second, depending on how many file descriptors are being watched by a worker thread. This made it quite hard to implement something that was required for some use cases I wanted to use Lwan for: the ability for a request handler to pause the execution for a specified amount of time.
Request handlers, being executed in a coroutine, are subject to a
cooperative scheduler, so they can’t just use system calls like
usleep(); instead, many changes in the main loop were performed to
support timers, paving the way for other potential enhancements in the
future (e.g. non-client file descriptor watching, for instance).
Each thread has a timer wheel data structure, which can be thought of a
efficient priority queue that has the sole purpose of maintaining a set
of timers. Usually, I’d just implement one from the scratch; however,
while looking for implementations for inspiration, I ended up stumbling
over
timeout.c,
which has a decent API and has seen some abuse in programs like TOR. I
performed a few minor tweaks when importing the library to the tree, and
am now using it to control how much time epoll_wait() will block
waiting for file descriptor events; and, if none were returned, timers
are processed, and coroutines are marked as “ready to be resumed”.
static void *thread_io_loop(void *data)
{
/* This is the entry point for the worker threads. The infinite loop
* below is the meat of event handling in Lwan. Everything passes
* through this loop. */
/* (Initialization omitted for brevity.) */
for (;;) {
/* Turning the timer wheel will also process pending timers,
* including the one that moves the timeout queue and updates the
* date/time cache. It then returns how much time epoll_wait()
* has to wait in ms. */
int timeout = turn_timer_wheel(&dq, t, epoll_fd);
int n_fds = epoll_wait(epoll_fd, events, max_events, timeout);
/* To quit Lwan, all epoll file descriptors are closed, and the
* threads are nudged. This wakes up epoll_wait(), and the next
* time it's called, it's going to probably return EBADF. */
if (UNLIKELY(n_fds < 0)) {
if (errno == EBADF || errno == EINVAL)
break;
continue;
}
for (struct epoll_event *event = events; n_fds--; event++) {
if (UNLIKELY(!event->data.ptr)) {
accept_nudge(read_pipe_fd, t, lwan->conns, &dq, &switcher,
epoll_fd);
continue;
}
struct lwan_connection *conn = event->data.ptr;
if (UNLIKELY(event->events & (EPOLLRDHUP | EPOLLHUP))) {
death_queue_kill(&dq, conn);
continue;
}
resume_coro(&dq, conn, epoll_fd);
death_queue_move_to_last(&dq, conn);
}
}
/* (Cleanup omitted for brevity.) */
return NULL;
}
While coroutines are sleeping, their file descriptor is still tracked by epoll, in case the connection is dropped by the peer. Awaking a coroutine is essentially just the matter of listening to the events that it was listening before it went on to sleep. (At the moment, this means both read and write events, as it was easier to implement – although different on kqueue systems, where reading/writing are not events but filters, which can’t be combined in a single throw.)
A coroutine expresses their desire to sleep to the main loop by yielding with a special value. (Previously, it could only say ‘yield, but resume me later’ or ‘yield, but destroy me whenever you can’; this changed in the recent months.)
With all the infrastructure work, the API entry point to suspend the corroutine for an specified amount of time becomes trivial:
static void remove_sleep(void *data1, void *data2)
{
struct timeouts *wheel = data1;
struct timeout *timeout = data2;
struct lwan_request *request =
container_of(timeout, struct lwan_request, timeout);
if (request->conn->flags & CONN_SUSPENDED_TIMER)
timeouts_del(wheel, timeout);
request->conn->flags &= ~CONN_HAS_REMOVE_SLEEP_DEFER;
}
void lwan_request_sleep(struct lwan_request *request, uint64_t ms)
{
struct lwan_connection *conn = request->conn;
struct timeouts *wheel = conn->thread->wheel;
request->timeout = (struct timeout) {};
timeouts_add(wheel, &request->timeout, ms);
if (!(conn->flags & CONN_HAS_REMOVE_SLEEP_DEFER)) {
coro_defer2(conn->coro, remove_sleep, wheel, &request->timeout);
conn->flags |= CONN_HAS_REMOVE_SLEEP_DEFER;
}
/* The meaning of CONN_CORO_SUSPEND_TIMER will become clear in the next
* section. */
coro_yield(conn->coro, CONN_CORO_SUSPEND_TIMER);
}
(The next section explains a little bit more of how the infrastructure works.)
Due to the nature of the hashed timer wheel, when epoll_wait() wakes
up every second to update the time cache and process the timeout queue,
it may wake up a few more times with intervals smaller than 1 second. In
practice, this ends up being slightly more accurate, as the system’s
monotonic clock source is used to offset any time between
epoll_wait() invocations, avoiding timer drift. (It’s slightly more
accurate because the coarse monotonic clock is used if available, and
the timeout in epoll_wait() uses the fine-grained variant instead.
Maybe a EPOLL_COARSE_CLOCKSOURCE flag to epoll_create1() would
be worthwhile investigating?)
Changing Connection Coroutine Yield Values
One of the trickiest bits in Lwan was its main loop and how it determined which epoll events mask to choose depending on the connection state. It was often the case that a connection would stall indefinitely for no reason, which was often caused when it had declared that it wanted to be resumed only when the socket was ready to read, when it wanted to write instead. The code responsible for this was very brittle and didn’t make much sense, although it worked most of the time. It really needed a big overhaul (which turned out to not be big, line-count wise.)
The main idea behind the changes here was to add new values that
coroutines could use to inform the scheduler what it was interested in:
reading from the socket, writing to the socket, sleeping, this sort of
stuff. This is mostly hidden from most of the code, though, being
necessary to be aware of only by the I/O wrappers (a coroutine now
yields automatically, for instance, if lwan_read() detects that
read() failed with a EAGAIN errno, to change the coroutine
intent to be resumed whenever the socket can be read again).
The nice thing about this change is that it changed a bunch of branches and operations with straightforward table lookups. In the past few years, I’ve been moving towards using lookup tables a whole lot more; it’s sometimes difficult to express a rat’s nest of branches in a way that’s efficient to look it up on a table, but it’s a satisfying feeling when all that crud is gone and all you’re left with are a few array accesses.
Contrast the new, shiny, table-based approach:
static ALWAYS_INLINE uint32_t
conn_flags_to_epoll_events(enum lwan_connection_flags flags)
{
static const uint32_t map[CONN_EVENTS_MASK + 1] = {
[0 /* Suspended by timer */] = EPOLLRDHUP,
[CONN_EVENTS_WRITE] = EPOLLOUT | EPOLLRDHUP,
[CONN_EVENTS_READ] = EPOLLIN | EPOLLRDHUP,
[CONN_EVENTS_READ_WRITE] = EPOLLIN | EPOLLOUT | EPOLLRDHUP,
};
return map[flags & CONN_EVENTS_MASK];
}
#if defined(__linux__)
# define CONN_EVENTS_RESUME_TIMER CONN_EVENTS_READ_WRITE
#else
/* Kqueue doesn't like when you filter on both read and write, so
* wait only on write when resuming a coro suspended by a timer.
* The I/O wrappers should yield if trying to read without anything
* in the buffer, changing the filter to only read, so this is OK. */
# define CONN_EVENTS_RESUME_TIMER CONN_EVENTS_WRITE
#endif
static void update_epoll_flags(int fd,
struct lwan_connection *conn,
int epoll_fd,
enum lwan_connection_coro_yield yield_result)
{
static const enum lwan_connection_flags or_mask[CONN_CORO_MAX] = {
[CONN_CORO_YIELD] = 0,
[CONN_CORO_WANT_READ_WRITE] = CONN_EVENTS_READ_WRITE,
[CONN_CORO_WANT_READ] = CONN_EVENTS_READ,
[CONN_CORO_WANT_WRITE] = CONN_EVENTS_WRITE,
/* While the coro is suspended, we're not interested in either EPOLLIN
* or EPOLLOUT events. We still want to track this fd in epoll, though,
* so unset both so that only EPOLLRDHUP (plus the implicitly-set ones)
* are set. */
[CONN_CORO_SUSPEND_TIMER] = CONN_SUSPENDED_TIMER,
/* Either EPOLLIN or EPOLLOUT have to be set here. There's no need to
* know which event, because they were both cleared when the coro was
* suspended. So set both flags here. This works because EPOLLET isn't
* used. */
[CONN_CORO_RESUME_TIMER] = CONN_EVENTS_RESUME_TIMER,
};
static const enum lwan_connection_flags and_mask[CONN_CORO_MAX] = {
[CONN_CORO_YIELD] = ~0,
[CONN_CORO_WANT_READ_WRITE] = ~0,
[CONN_CORO_WANT_READ] = ~CONN_EVENTS_WRITE,
[CONN_CORO_WANT_WRITE] = ~CONN_EVENTS_READ,
[CONN_CORO_SUSPEND_TIMER] = ~CONN_EVENTS_READ_WRITE,
[CONN_CORO_RESUME_TIMER] = ~CONN_SUSPENDED_TIMER,
};
enum lwan_connection_flags prev_flags = conn->flags;
conn->flags |= or_mask[yield_result];
conn->flags &= and_mask[yield_result];
if (conn->flags == prev_flags)
return;
struct epoll_event event = {
.events = conn_flags_to_epoll_events(conn->flags),
.data.ptr = conn,
};
if (UNLIKELY(epoll_ctl(epoll_fd, EPOLL_CTL_MOD, fd, &event) < 0))
lwan_status_perror("epoll_ctl");
}
With the crusty, buggy, old approach that only worked by chance (and was the source of a lot of headache):
static void update_epoll_flags(struct death_queue *dq,
struct lwan_connection *conn,
int epoll_fd,
enum lwan_connection_coro_yield yield_result)
{
uint32_t events = 0;
bool write_events;
if (UNLIKELY(conn->flags & CONN_RESUMED_FROM_TIMER)) {
conn->flags &= ~(CONN_RESUMED_FROM_TIMER | CONN_WRITE_EVENTS);
write_events = false;
} else if (UNLIKELY(conn->flags & CONN_SUSPENDED_BY_TIMER)) {
/* CONN_WRITE_EVENTS shouldn't be flipped in this case. */
events = EPOLLERR | EPOLLRDHUP;
} else if (conn->flags & CONN_MUST_READ) {
write_events = true;
} else {
bool should_resume_coro = (yield_result == CONN_CORO_MAY_RESUME);
if (should_resume_coro)
conn->flags |= CONN_SHOULD_RESUME_CORO;
else
conn->flags &= ~CONN_SHOULD_RESUME_CORO;
write_events = (conn->flags & CONN_WRITE_EVENTS);
if (should_resume_coro == write_events)
return;
}
if (LIKELY(!events)) {
events = events_by_write_flag[write_events];
conn->flags ^= CONN_WRITE_EVENTS;
}
struct epoll_event event = {.events = events, .data.ptr = conn};
int fd = lwan_connection_get_fd(dq->lwan, conn);
if (UNLIKELY(epoll_ctl(epoll_fd, EPOLL_CTL_MOD, fd, &event) < 0))
lwan_status_perror("epoll_ctl");
}
Changes in parsers
RFC822 (Date headers)
Unhappy with the performance of strptime(), I came up with a parser
that fits the theme of the rest of the HTTP parser in Lwan quite well:
by using string switch statements, the new time parser is faster by a
around 12x when compared with the generic one from the C
library.
int lwan_parse_rfc_time(const char in[static 30], time_t *out)
{
/* This function is used instead of strptime() because locale
* information can affect the parsing. Instead of defining
* the locale to "C", use hardcoded constants. */
struct tm tm;
const char *str = in;
STRING_SWITCH(str) {
case STR4_INT('S','u','n',','): tm.tm_wday = 0; break;
case STR4_INT('M','o','n',','): tm.tm_wday = 1; break;
case STR4_INT('T','u','e',','): tm.tm_wday = 2; break;
case STR4_INT('W','e','d',','): tm.tm_wday = 3; break;
case STR4_INT('T','h','u',','): tm.tm_wday = 4; break;
case STR4_INT('F','r','i',','): tm.tm_wday = 5; break;
case STR4_INT('S','a','t',','): tm.tm_wday = 6; break;
default: return -EINVAL;
}
str += 5;
tm.tm_mday = parse_2_digit_num(str, ' ', 1, 31);
if (UNLIKELY(tm.tm_mday < 0))
return -EINVAL;
str += 3;
STRING_SWITCH(str) {
case STR4_INT('J','a','n',' '): tm.tm_mon = 0; break;
case STR4_INT('F','e','b',' '): tm.tm_mon = 1; break;
case STR4_INT('M','a','r',' '): tm.tm_mon = 2; break;
case STR4_INT('A','p','r',' '): tm.tm_mon = 3; break;
case STR4_INT('M','a','y',' '): tm.tm_mon = 4; break;
case STR4_INT('J','u','n',' '): tm.tm_mon = 5; break;
case STR4_INT('J','u','l',' '): tm.tm_mon = 6; break;
case STR4_INT('A','u','g',' '): tm.tm_mon = 7; break;
case STR4_INT('S','e','p',' '): tm.tm_mon = 8; break;
case STR4_INT('O','c','t',' '): tm.tm_mon = 9; break;
case STR4_INT('N','o','v',' '): tm.tm_mon = 10; break;
case STR4_INT('D','e','c',' '): tm.tm_mon = 11; break;
default: return -EINVAL;
}
str += 4;
tm.tm_year = parse_int(strndupa(str, 4), -1);
if (UNLIKELY(tm.tm_year < 0))
return -EINVAL;
tm.tm_year -= 1900;
if (UNLIKELY(tm.tm_year < 0 || tm.tm_year > 1000))
return -EINVAL;
str += 5;
tm.tm_hour = parse_2_digit_num(str, ':', 0, 23);
str += 3;
tm.tm_min = parse_2_digit_num(str, ':', 0, 59);
str += 3;
tm.tm_sec = parse_2_digit_num(str, ' ', 0, 59);
str += 3;
STRING_SWITCH(str) {
case STR4_INT('G','M','T','\0'):
tm.tm_isdst = -1;
*out = timegm(&tm);
if (UNLIKELY(*out == (time_t)-1))
return -EINVAL;
return 0;
default:
return -EINVAL;
}
}
Wojciech Muła wrote a better version using SIMD instructions after they learned about this routine; it was then improved by Kendall Willets to use a different technique (a perfect hash table). Both are faster than the version used in Lwan, but I decided to keep my version because not only it is fast enough, it’s also more maintainable; it also serves as a pretty good example for the string switch trick.
(As a complement to this change, the function that converts a time_t
into the same string that lwan_parse_rfc_time() parses has been
written as well, and it’s as efficient as I could make it, even going to
the effort of reducing the number of divisions to convert integers into
strings from 6 to just 1.)
Configuration parser & Template parser
Both the configuration file parser and the template parser were written without paying attention to anything related to compiler theory; just an ad-hoc parser without a proper lexer, looking at a line at a time. Both had a lot of workarounds and were generally hard to extend and debug. They needed to be rewritten, but I really didn’t want to use a parser generator (I really don’t like them).
For both of these parsers, used the same technique in the JSON parser I wrote for the Zephyr project: a state machine (where the variable that holds the state actually holds a pointer to a function that handles that state), and a ring buffer. I really like this technique, as it’s often easier to grasp than many of the parser generators out there. (There’s the added benefit that no new build dependencies or build system changes is required.)
Here’s how one of these functions might look; this one doesn’t use the ring buffer, only consumes bytes from the input:
static void *lex_comment(struct lexer *lexer)
{
/* lex_config() returns 'lex_comment' when 'next(lexer)' returns '#' */
while (iscomment(next(lexer)))
;
/* Current character isn't a comment, back it up: next state should be
* able to read it and determine which state is going to process the
* next token. */
backup(lexer);
/* lex_config() handles the main state */
return lex_config;
}
And here’s how the function that handles the “get me a new token” function that the parser calls looks:
static bool lex_next(struct lexer *lexer, struct lexeme **lexeme)
{
/* To end the state machine, a state-handling function returns NULL,
* ending this loop. */
while (lexer->state) {
/* If a state handling function emits one or more tokens, for as
* long as lex_next() is called, instead of calling the function to
* handle the current state, items from the ring buffer are popped
* instead. */
if (lexeme_buffer_consume(&lexer->buffer, lexeme))
return true;
/* Run the state machine for the current state, and update it if
* necessary. */
lexer->state = lexer->state(lexer);
}
/* Exhaust the ring buffer until there's nothing left. */
return lexeme_buffer_consume(&lexer->buffer, lexeme);
}
The configuration file parser got a few nice features, like the ability to expand environment variables (or use a default value provided in the configuration file), multiline strings, and other minor changes and bugfixes that would be too complicated to implement in the previous bespoke one-line-at-a-time thing that was in place.
The template parser also got significantly more robust, handling some
corner cases that were just impossible before. (No new significant
changes have been performed in the runtime portion, though, although a
few tweaks here and there were made over the years.) Unrelated to the
parser changes, a template can be “compiled” with an option that won’t
allocate and copy a new string for each text fragment, but use text
that’s in memory somewhere; this is useful for things such as the file
serving module or the default response generator, when a template has
not been supplied by the user in the configuration file, where
strbuf structs used by the template mechanism can just point into
somewhere in rodata.
Readahead & Madvise
The Linux system call sendfile() doesn’t take flags like the FreeBSD
variant, so it’s impossible to tell it not to block if data isn’t in the
core yet. The best one can do, as far as I can tell, is to make sure
that the data is already there when it is invoked. That’s why Lwan now
has a low-priority thread (which has also low-priority I/O thread on
Linux) that will call readahead() on file chunks as they’re being
served.
This thread takes commands through a pipe, where only the write side is
non-blocking. (Failure to write to that socket isn’t an issue, as this
is merely an optimization, so this is one of the rare cases in Lwan
where a syscall error isn’t handled.) As commands are received, that
thread is free to block (or, more specifically, wait for as much as
necessary) to load the contents into core, so that hopefully the thread
calling sendfile() won’t block. This optimization was only possible
because the cache in the file serving module keeps the files open.
For the cases where files are served with a memory-mapped blob of
memory, the problems can be even more apparent; there’s no way to say
“dereference this memory but please don’t block” in any of the supported
platforms. Your only bet is to memory map it, give a hint to the OS to
pre-fault the pages, and lock it in memory. But the madvise() and
mlock() combo used for this will eventually encounter similar
problems as with the readahead() syscall, in which while they might
not exactly block, they’ll take some time to process whatever they need
to process. So the same thread that performs readahead() will also
try to bring stuff to the core and keep in there for as long as it’s in
the file serving cache instance.
Even with a thread performing readahead(), it’s possible for the I/O
thread to block (or be stuck in an operation that takes more time than
absolutely necessary), increasing latency when handling other
connections assigned to them. This why I’ve been thinking for quite a
while now on how to introduce a work-stealing scheduler to Lwan, with a
watchdog thread to determine if worker threads are not making any kind
of progress; but this is something for the future, I guess.
(Recent advancements in things like io_uring and the AIO subsystem
may change how these things are used in Lwan.
PSI can
be used to avoid doing things if the system is under pressure. I have
not investigated those in depth so far.)
Coroutine changes
There has been a few minor tweaks in the coroutine implementation. Nothing groundbreaking, but worth mentioning anyway.
The data pointer has been removed from the coro structure, as it can
be simply passed as parameter to the function implementing the
coroutine. This required the trampoline function for x86-64 to be
written in assembly since one of the registers used for parameter
passing are not in the list of caller-saved registers (and thus not
saved/restored by the context swapping routine), but otherwise,
everything seems to be working as expected.
Deferred callbacks are now stored in an array rather than using a linked
list. This makes it cheaper to use defer (less calls to malloc() for
each coro_defer() call), which has become an important aspect in how
resources are cleaned up in Lwan. In order to reduce heap memory
allocations, the array is allocated initially inline with the coroutine
struct and moves to the heap if necessary; this cuts two round trips to
the memory allocator per connection in the usual case (this optimization
is available for all array structs if desired; more on this below).
The following table shows the number of malloc() calls for 100,000
requests to a “Hello, World” handler and the impact of inlining deferred
calls, for both keep-alive and close connections (with 1000 concurrent
connections):
Connection |
Malloc Calls Before |
Malloc Calls After |
|---|---|---|
Keep-Alive |
1,031 |
931 |
Close |
300,731 |
200,728 |
An embarassing fix in the coroutine implementation has been in how the
stack is aligned on x86-64. In some situations, Lwan was crashing (with
a SIGSEGV no less) on instructions that were not meant to cause this
kind of signal (e.g. it would crash when Lua was converting a number
into string). Tools such as Valgrind, sanitizers, and debuggers weren’t
that helpful to pinpoint the location. It turns out that how the stack
pointer for coroutines were aligned was incorrect: had sprintf()
(what Lua uses to convert floating point numbers to strings) been
implemented using x87, this would work fine; however, SSE requires
aligned memory to work, so things were crashing on such a trivial
operation. It’s now aligned on a 16-byte boundary, adjusted to be
aligned on an 8-byte boundary right before the trampoline routine is
called. This one-line change took me a lot more hours than I care to
admit, but it’s now finally fixed. It was the first time I used
rr, and it’s now an integral part of my
toolbelt.
Sending responses
For some responses, the writev() system call was used, so that the
response headers and body could be sent to the wire with a single
syscall. However, the kernel has to copy the I/O vector array, validate
it, and then perform the write operation; this has a relatively high
cost, and if one is trying to send a small response, this cost might not
pay off. I haven’t considered this and always wondered why that was the
case when writev() was used to build responses (e.g. using multiple
struct iovec elements pointing to header names and values instead of
building the response header in memory and sending that). I was
surprised, then, to learn that by reusing the buffer designated for the
response headers and writing the body there (given there was enough
space), and using send() instead, the RPS rate increased by ~10%.
void lwan_response(struct lwan_request *request, enum lwan_http_status status)
{
/* ... */
char headers[DEFAULT_HEADERS_SIZE];
/* ... */
char *resp_buf = lwan_strbuf_get_buffer(response->buffer);
const size_t resp_len = lwan_strbuf_get_length(response->buffer);
if (sizeof(headers) - header_len > resp_len) {
/* writev() has to allocate, copy, and validate the response vector,
* so use send() for responses small enough to fit the headers
* buffer. On Linux, this is ~10% faster. */
memcpy(headers + header_len, resp_buf, resp_len);
lwan_send(request, headers, header_len + resp_len, 0);
} else {
struct iovec response_vec[] = {
{.iov_base = headers, .iov_len = header_len},
{.iov_base = resp_buf, .iov_len = resp_len},
};
lwan_writev(request, response_vec, N_ELEMENTS(response_vec));
}
}
In order to increase the performance of pipelined requests, Lwan will
also call send() with the MSG_MORE flag. This should cause in
less TCP segments being sent over the wire, significantly improving the
RPS rate (~20% higher if I’m not mistaken). This is akin to setting the
TCP_CORK socket flag, but not only this is more portable, system
calls are saved; nonetheless, the flag that controls if the I/O wrappers
will set that flag still mentions the cork flag. (Linux has TCP
autocorking, which in theory should make this optimization useless, but
in my limited testing it didn’t help that much.) The writev() system
call, which is still used for those responses that are larger than it
would fit in the response headers buffer, doesn’t take a flags
parameter; however, it’s possible to use sendmsg() to the same
effect (it also takes the same I/O vector and has the same behavior as
writev() when it comes to short writes due to non-blocking sockets.)
Adding to the “I’ve been adopting lookup tables whenever I can” comment above: in order to determine if a response has a body, instead of a table, the actual flags in the request struct that are used to determine the HTTP verb encodes this information: if their least-significant-bit is set, then the response should contain a body; otherwise, only headers will be sent. Went from an array lookup to a mere mask. Still a table, just encoded differently.
CPU topology aware thread pinning / scheduling
One of the things that Lwan has been doing since the beginning was to allocate a big chunk of memory to containing information about client connections. Indexed by the file descriptor, looking up the element is quick and efficient; however also makes it possible for false sharing to happen (the struct is exactly 32 bytes on x86-64, so that two fit in a cache line), so connections are scheduled to threads in such a way that this is avoided.
Previously, however, this would only consider that the CPU topology for my personal laptop was the gold standard. Anything different and it would mean that it would probably make the problem even worse. (I’m not kidding when I say that this project is but a toy.)
Now, Lwan, at least on x86_64 Linux (as it relies on information exposed
by sysfs rather than messing around with cpuid and the likes), reads
the CPU topology and uses that information to not only pre-schedule all
connections correctly (minimizing a little bit of the work necessary
whenever one is accepted), but to also pin the threads to the correct
logical CPU.
Continuous Fuzz-Testing
Writing your own parser, especially when it comes to something that’s connected to a network socket, always should raise suspicions. Parsers can be tricky to get right, and it’s very easy to accidentally trigger a footgun when doing so in a language like C.
Nevertheless, I still do it; this project is but a hobby to me, and running with scissors is part of why it’s enjoyable to me. As careful as I am with it (or would like to think I am), it’s still developed while I’m tired from work or when I had a few beers; in other words, I needed to test the project methodically. I had used fuzz-testers before, but never let them running overnight or over extended periods of time; my main machine is a laptop, and overheating is an issue.
It was much to my surprise, then, when the folks at Google accepted my weekend project on OSS-Fuzz. So far, it has found a few issues, which were promptly fixed; however, although it’s still hard at work, generating some heat somewhere, the parser proved to be pretty resilient. I’m pretty happy with the results.
This of course is only testing the request parser, and there’s more to a web server than that; some of which aren’t that easy to fuzz-test (at least not in an automated way). Most of the code being fuzzed is the more important (and network-facing) request parsing code; coverage for this is around 15%, which is pretty decent all things considered. The configuration file parser is also being fuzzed, although the work has just recently started (I don’t have a lot of information on this yet). Time permitting, I’ll add more fuzzers to the mix.
As I write this, over 56 trillion tests in the past 30 days (counting only the tests with libFuzzer, although AFL is also used), which is nothing short of amazing.
Portability
The astute reader, or at least one that has been following Lwan for a while, might have noticed that portability has been mentioned a few times in this article. This was not a concern a few years ago, but since then, Lwan has been ported to work on BSD systems as well (mostly by implementing epoll on top of kqueue, and providing a sendfile I/O wrapper that works regardless of the underlying system), although it has been only tested on FreeBSD, macOS, and OpenBSD. (Recent OpenBSDs might require some tinkering, as they enforce the stack pointer to be within pages mapped with a specific flag.)
You can read more about how portability has been achieved by reading my blog post on using the non-standard #include_next preprocessor directive, which saved me from writing abstraction layers.
Declaring new Lua metamethods
Another change that might improve the comfort of people using Lwan with Lua scripts is that it’s easier to add new metamethods to the request table.
One just declares a new C function with the LWAN_LUA_METHOD macro,
and, during startup, Lwan will attach that function to the request
metatable as a metamethod. (One can call
lwan_lua_get_request_from_userdata() to obtain a
struct lwan_request * from the first parameter in the C
implementation of one of these metamethods.)
LWAN_LUA_METHOD(say)
{
struct lwan_request *request = lwan_lua_get_request_from_userdata(L);
size_t response_str_len;
const char *response_str = lua_tolstring(L, -1, &response_str_len);
lwan_strbuf_set_static(request->response.buffer, response_str,
response_str_len);
lwan_response_send_chunk(request);
return 0;
}
This macro works by adding a struct to a certain section in the executable, so new application-specific metamethods can be linked together with Lwan without having to modify Lwan itself.
#define LWAN_LUA_METHOD(name_) \
static int lwan_lua_method_##name_(lua_State *L); \
static const struct lwan_lua_method_info \
__attribute__((used, section(LWAN_SECTION_NAME(lwan_lua_method)))) \
lwan_lua_method_info_##name_ = {.name = #name_, \
.func = lwan_lua_method_##name_}; \
static int lwan_lua_method_##name_(lua_State *L)
All functions in lwan-lua.c are now implemented this way to serve as
an example on how to use this feature.
Declaring Handlers and Modules
In order to make it more portable, safer, and easier to declare a
handler function in Lwan, a macro similar to LWAN_LUA_METHOD() has
been provided: LWAN_HANDLER(). It will declare a handler function
and a struct lwan_handler_info in a specific section in the
executable, making it easier for the configuration file reader to find
it every time, regardless of how that platform exports symbols, make it
impossible to refer to any exported symbol as a handler from the
configuration file, and slightly simplifies declaring a handler function
(without requiring, for instance, changing tables to make the
configuration file reader happy).
Defining a handler with this macro is trivial:
LWAN_HANDLER(brew_coffee)
{
/* Placeholder handler to force the linker to define __start_lwan_handler and
* __stop_lwan_handler. */
return HTTP_I_AM_A_TEAPOT;
}
As is looking it up (the linker does the job of registering each handler):
__attribute__((no_sanitize_address))
static void *find_handler(const char *name)
{
extern const struct lwan_handler_info SECTION_START(lwan_handler);
extern const struct lwan_handler_info SECTION_END(lwan_handler);
const struct lwan_handler_info *handler;
for (handler = __start_lwan_handler; handler < __stop_lwan_handler;
handler++) {
if (streq(handler->name, name))
return handler->handler;
}
return NULL;
}
A similar feature has been provided for modules, making it even possible
to list them from the Lwan command-line (lwan -m and lwan -H to
list modules and handlers, respectively). However, instead of specifying
a function as with LWAN_HANDLER(), one specifies a
struct lwan_module with LWAN_REGISTER_MODULE().
static const struct lwan_module module = {
.create = serve_files_create,
.create_from_hash = serve_files_create_from_hash,
.destroy = serve_files_destroy,
.handle_request = serve_files_handle_request,
.flags = HANDLER_PARSE_ACCEPT_ENCODING,
};
LWAN_REGISTER_MODULE(serve_files, &module);
In all cases, the only symbol that ends up being visible is the
associated _info struct; the handler function and module struct are
not exported.
Changes to the String Buffer
A HTTP/1.x server is essentially a program that transforms strings into
strings over a network connection, so some sort of facility to create
them efficiently is often desired. In Lwan, this is struct strbuf,
which has seen some trivial changes over the years, mainly to reduce the
amount of memory they need to work with (as every request struct has to
carry one of them).
It initially had two fields, used and capacity, where
capacity would be always the next power of two after used (and
the buffer would be reallocated accordingly). The capacity field has
now been removed; it’s now derived from used, as it’s cheap to align
to the next power of two to calculate it whenever needed.
Each connection triggered the allocation of a buffer for its associated
request struct strbuf, even if it might not be used (e.g. for
streaming requests, such as file serving). A strbuf is now
initialized pointing to an empty static string (with the STATIC flag
set); that’ll delay the allocation for when it’s truly needed. The
following table builds on the results from the coroutine optimizations,
optimizing the base case even further:
Connection |
Malloc Calls Before |
Malloc Calls After |
|---|---|---|
Keep-Alive |
931 |
831 |
Close |
200,731 |
100,728 |
Hash Table Changes
The hash table has also seen some changes, most notably the ability to rehash (although the heuristic to determine if rehashing is required might need some work). The hash value is kept alongside each bucket element, slightlyy alleviating the cost.
Another simple change that has been implemented is that, when an entry is removed from a bucket, instead of defragmenting that bucket, the last element is copied on top of the element being removed. Since order in a bucket isn’t important, this made removing elements quite a bit more efficient (and given that the hash table is an integral part of the cache subsystem, this is important to keep the write lock locked for the minimum amount of time.)
I’ve experimented with other techniques to implement a hash table, including using robin-hood hashing, but I’m keeping it this way for the moment. It’s something I want to revisit someday. (Also played with hashing functions, including using AES-NI instructions instead of the CRC32C from SSE4.2, but never got as far as integrating the experiments in Lwan.)
Gracefully Closing Sockets
The other day I received a bug report where pages served by Lwan and
loaded by W3M would take a long time to load. Not only this is unusual
because someone is actually using Lwan, but also because someone is
actually using W3M in 2019. Nevertheless, this was an interesting bug
with a simple fix: W3M implements only HTTP/1.0, and until the server
closes the connection, it won’t start parsing and displaying the results
(the connection would be closed after the keep-alive timeout was
reached, which is roughly 15s); now Lwan closes the connection as soon
as it’s done processing it, if it’s not marked as keep-alive. Easy to
spot using something like strace.
Closing a TCP socket, however, isn’t as simple as just calling
close(): there might be some enqueued packets not yet acknowledged
by the peer, so the usual solution to this is to call
shutdown(fd, SHUT_WR) to stop any kind of transmission and wait on a
loop calling read() until it returns 0 (signaling that the peer
has closed the connection), at which point it’s safe to just call
close() and consider that the connection has been closed.
However, from my testing, the scenario where one actually needs to wait
on a read() loop isn’t that common, especially with the big fat
pipes that are common these days; so, in order to minimize the number of
system calls made in the happy path, Lwan now checks if there are any
pending bytes to be sent/acknowledged by the peer before proceeding with
the usual method. This should equate to 2 system calls to close a
connection (the ioctl() and close()), rather than at least 3
(shutdown() + read() + close()) in the happy path.
static void graceful_close(struct lwan *l,
struct lwan_connection *conn,
char buffer[static DEFAULT_BUFFER_SIZE])
{
int fd = lwan_connection_get_fd(l, conn);
while (TIOCOUTQ) {
/* This ioctl isn't probably doing what it says on the tin; the details
* are subtle, but it seems to do the trick to allow gracefully closing
* the connection in some cases with minimal system calls. */
int bytes_waiting;
int r = ioctl(fd, TIOCOUTQ, &bytes_waiting);
if (!r && !bytes_waiting) /* See note about close(2) below. */
return;
if (r < 0 && errno == EINTR)
continue;
break;
}
if (UNLIKELY(shutdown(fd, SHUT_WR) < 0)) {
if (UNLIKELY(errno == ENOTCONN))
return;
}
for (int tries = 0; tries < 20; tries++) {
ssize_t r = read(fd, buffer, DEFAULT_BUFFER_SIZE);
if (!r)
break;
if (r < 0) {
switch (errno) {
case EINTR:
continue;
case EAGAIN:
coro_yield(conn->coro, CONN_CORO_WANT_READ);
continue;
default:
return;
}
}
coro_yield(conn->coro, CONN_CORO_YIELD);
}
/* close(2) will be called when the coroutine yields with CONN_CORO_ABORT */
}
(On platforms where that ioctl() isn’t available, the usual method
is used instead. TIOCOUTQ is Linux-specific, but it’s defined to
0 or the equivalent value in other OSes with some #include_next
magic, so that the while() loop there works as a loop to both handle
the ioctl() call being interrupted and check if that ioctl is
available on that platform.)
MIME Type table improvements
Lwan contains an internal MIME Type table based on the public domain
table made for the Apache httpd project. The same file is used by
mimegen, a program built and used only during build time, that
generates a header file containing data suitable to be searched with
bsearch(). (The data in the header file is also compressed, saving a
few dozen kilobytes in the final executable.)
The table was initially laid out as an array of
struct { const char *extension; const char *mime_type; }, all
pointing to different positions within a character array. In other
words, not only only 4 entries could fit in a cache line (assuming
x86-64 here, with 8-byte pointers and 64-byte cache lines), each access
would mean that other cache lines had to be used to proxy the character
array.
With some trivial changes, this has been significantly improved: instead
of having a single table with extension+type, two (in-sync) tables are
provided. Search happens only in the first table; once an item is found
there, its position within the first table is used as an index in the
second table. This greatly reduces the cache pressure (each item in the
first table is fixed at 8 characters, so double the cache density for
bsearch() to breeze through quickly).
The change in the function to look up a MIME type given a file name didn’t change much from 2014:
const char *
lwan_determine_mime_type_for_file_name(const char *file_name)
{
char *last_dot = strrchr(file_name, '.');
if (UNLIKELY(!last_dot))
goto fallback;
STRING_SWITCH_L(last_dot) {
case STR4_INT_L('.','j','p','g'): return "image/jpeg";
case STR4_INT_L('.','p','n','g'): return "image/png";
case STR4_INT_L('.','h','t','m'): return "text/html";
case STR4_INT_L('.','c','s','s'): return "text/css";
case STR4_INT_L('.','t','x','t'): return "text/plain";
case STR4_INT_L('.','j','s',0x20): return "application/javascript";
}
if (LIKELY(*last_dot)) {
char *extension, *key = last_dot + 1;
extension = bsearch(key, uncompressed_mime_entries, MIME_ENTRIES, 8,
compare_mime_entry);
if (LIKELY(extension))
return mime_types[(extension - (char*)uncompressed_mime_entries) / 8];
}
fallback:
return "application/octet-stream";
}
This idea is far from novel; in fact, it’s common in video-game development and is known as data-oriented design.
(I investigated the possibility of using gperf here instead, but decided against it as this would require two build-time programs instead of one for this feature.)
Accepting Clients
While it’s possible to wake a worker thread blocked on epoll_wait()
by adding a file descriptor to its watched set (by watching EPOLLOUT
instead of EPOLLIN, even though you want to read from the socket
before sending a response, as counter-intuitive as this might sound), I
found that this isn’t the best approach. (Don’t know why yet, though.)
Until recently, Lwan was using a pipe to send the file descriptor number
from the main thread (which accepts the connection) to the worker thread
that would forever own it. This worked well, but meant that every new
connection would require at least two system calls that were unrelated
to actually handling the connections: accept4() and write().
While pipes aren’t exactly slow (they’re just buffers in the kernel
anyway), it’s still a lot of overhead to write 4 bytes to another thread
in the same process.
Recent versions of Lwan uses a different approach: a lock-free
single-producer-single-consumer queue and an eventfd, are used to
queue file descriptors until the “horde” has passed (or the queue got
full). (While the main thread is accepting connections, it’s handling a
“horde”; once accept4() returns a EAGAIN error, the horde is
gone and the main thread can proceed.) The eventfd is then used to wake
up the worker thread (“nudging” in Lwan jargon), at which point it
proceeds to add the file descriptors to the sets, creates the associated
coroutines, and stuff like this.
In lwan-thread.c, we define a function that tries adding a new
client to a worker thread a few times, dropping the connection if that
worker thread is somehow hosed even after repeated attempts at nudging
it:
void lwan_thread_add_client(struct lwan_thread *t, int fd)
{
for (int i = 0; i < 10; i++) {
bool pushed = spsc_queue_push(&t->pending_fds, fd);
if (LIKELY(pushed))
return;
/* Queue is full; nudge the thread to consume it. */
lwan_thread_nudge(t);
}
lwan_status_error("Dropping connection %d", fd);
/* FIXME: send "busy" response now, even without receiving request? */
close(fd);
}
And in lwan.c, you can see how the coroutine that handles incoming
connections batches each incoming herd and minimizes the amount of
nudges to wake up the worker threads:
static ALWAYS_INLINE int schedule_client(struct lwan *l, int fd)
{
struct lwan_thread *thread = l->conns[fd].thread;
lwan_thread_add_client(thread, fd);
/* Connections are pre-scheduled, but we need a thread index, not a
* pointer to a struct lwan_thread. */
return (int)(thread - l->thread.threads);
}
/* Using -1, 0, and 1 for enumeration values allows you to test them
* using only comparisons with 0. */
enum herd_accept { HERD_MORE = 0, HERD_GONE = -1, HERD_SHUTDOWN = 1 };
struct core_bitmap {
uint64_t bitmap[4]; /* 256 processors should be enough for everybody */
};
static ALWAYS_INLINE enum herd_accept
accept_one(struct lwan *l, struct core_bitmap *cores)
{
int fd = accept4((int)main_socket, NULL, NULL, SOCK_NONBLOCK | SOCK_CLOEXEC);
if (LIKELY(fd >= 0)) {
int core = schedule_client(l, fd);
cores->bitmap[core / 64] |= UINT64_C(1)<<(core % 64);
return HERD_MORE;
}
switch (errno) {
case EAGAIN:
return HERD_GONE;
case EBADF:
case ECONNABORTED:
case EINVAL:
if (main_socket < 0) {
lwan_status_info("Signal 2 (Interrupt) received");
} else {
lwan_status_info("Main socket closed for unknown reasons");
}
return HERD_SHUTDOWN;
default:
lwan_status_perror("accept");
return HERD_MORE;
}
}
static int accept_connection_coro(struct coro *coro, void *data)
{
struct lwan *l = data;
struct core_bitmap cores = {};
while (coro_yield(coro, 1) & ~(EPOLLHUP | EPOLLRDHUP | EPOLLERR)) {
enum herd_accept ha;
do {
ha = accept_one(l, &cores);
} while (ha == HERD_MORE);
if (UNLIKELY(ha > HERD_MORE))
break;
/* A thread bitmap is maintained: accept_one() will set the nth bit
* to signify that the nth thread needs to be nudged. This loop
* will then quickly go through every set bit in that bitmap and
* nudge the appropriate thread. */
for (size_t i = 0; i < N_ELEMENTS(cores.bitmap); i++) {
for (uint64_t c = cores.bitmap[i]; c; c ^= c & -c) {
size_t core = (size_t)__builtin_ctzl(c);
lwan_thread_nudge(&l->thread.threads[i * 64 + core]);
}
}
memset(&cores, 0, sizeof(cores));
}
return 0;
}
On any other platform other than Linux, a pipe is used for the same effect; it just doesn’t scale as well (requires two file descriptors per worker thread instead of just one, allocates a larger kernel buffer for no purpose whatsoever, etc.).
Weirdly enough, as much as the approach of adding the client sockets
directly to the worker thread’s epoll set with an EPOLLOUT event
rather than using this queue+eventfd mechanism reduced the amount of
system calls, the throughput when using non-keeepalive connections has
been measurably reduced (and not just in the noise). I don’t know why,
and haven’t investigated this yet.
(The current approach limits the number of worker threads that Lwan can spawn, as threads are only nudged when connections have been assigned to them. Currently it’s at 256 threads, which is fine for many systems today – and certainly way better than any machine I have access to.)
Sample programs
FreeGeoIP
One of the first sample applications that I’ve written for Lwan was an
implementation of the freegeoip.net service. It’s been working fine
so far, serving a few hundred thousands requests per day (down from a
few million per day), using around 3MB of memory. It’s also pretty
stable: the server has been recently rebooted to update the kernel, and
before this happened, I observed that the service was chugging along for
over a year.
I’m pretty pleased with this, especially if one considers that the original service that this has been cloned from caused a lot of maintenance headache and was eventually abandoned.
Clock
A newer sample is the clock application. This generates a never-ending GIF file, served with chunked encoding, that draws the current time in a variety of styles: something that resembles a 7-segment display; xdaliclock; and a Tetris-like animation (where falling blocks are rotated until they fit and form digits).
This is but a hack, so it doesn’t work in all browsers (it’s known to be broken on Safari for instance), but in supported ones, it’s a cheap way to make animations without JavaScript, or to stream content from a server.
Look at those melting digits! No JavaScript or CSS required.
(And, of course, the “never-ending” aspect isn’t actually correct. Bots would try to download the GIFs, without any kind of timeout. I’ve seen bots trying to download those for days. It now forces the page to be reloaded every hour and limits each GIF to a little bit more than that. Implement timeouts when writing crawlers, people.)
Proxy Protocol
Putting Lwan behind a proxy is a common scenario to workaround the lack
of two features: virtual hosts and TLS. Both are somewhat trivial to
implement, but I haven’t gotten around supporting any of them because,
in my use case, I’m perfectly happy to stick a
Varnish cache and a TLS
terminator in front of it. In order to allow
functions such as lwan_request_get_remote_address() to return the
correct client address (instead of, say, ::1), Lwan has to be aware
that it’s being proxied.
A popular protocol for this, pioneered by HAProxy, is aptly named PROXY protocol. An implementation for both versions 1 and 2 has been contributed to Lwan by Malthe Borch as one of the first open source contributions to Lwan; thank you very much! (This is disabled by default because it should only be used if Lwan is known to be behind a reverse proxy. Inadvertently enabling it would allow anyone to easily spoof the client IP address to request handlers.)
(I did investigate using kTLS, but didn’t go that far at the time as it wasn’t part of mainline Linux kernel. I might give it a try someday now that it is, though.)
WebSockets
Lwan has also gained the ability to function as a WebSockets server. The protocol is trivial to implement (wonky handshaking and unoptimal framing notwithstanding), but defining an API that’s usable from C is quite challenging. I’ve tried a few combinations but never got to the point where I could find something I liked; I have some ideas to try but they’re still pretty fuzzy in my head.
The program below illustrates the current state of the API: while it seems straightforward enough for an endpoint to check if a WebSocket connection upgrade was requested by the client, and trivial to send stuff over the wire, there are a few drawbacks with the current situation:
PING requests will only be processed by
lwan_response_websocket_read(); this example will never respond to a PING packet and might be disconnected by a client.Both the read and the write primitives are blocking; it’s not possible to write something that can both wait for commands and occasionally send data over the wire. It should be possible to, for instance, write an echo server, but it’s not possible to implement socket.io for instance.
There are currently no tests for WebSockets as well; testing is performed manually by using the WebSockets inspection tool in web browsers, but automated tests would be preferred. I suspect that reading from a WebSocket is broken after a lot of unrelated changes in the main loop.
LWAN_HANDLER(ws)
{
/* Requesting an upgrade will send an appropriate response if the
* request contained a valid handshake, returning HTTP code 101 in that
* case. Any other error code won't generate a default response,
* so the handler can return at this point.
*
* Having a separate function to upgrade a WebSocket connection
* (as opposed to having a flag in a handler that would try to do
* this automatically) is useful to allow people to write, for instance,
* authorization code to explicitly upgrade a connection only if
* certain checks passes. */
enum lwan_http_status status = lwan_request_websocket_upgrade(request);
if (status != HTTP_SWITCHING_PROTOCOLS)
return status;
while (true) {
/* Similar to chunked encoding and server-sent events, the response
* string buffer contains the data that is going to the wire, or
* data that was received from the wire. The strbuf is reset every
* time a response is sent or received. */
lwan_strbuf_printf(response->buffer, "Some random integer: %d", rand());
lwan_response_websocket_write(request);
lwan_request_sleep(request, 1000);
}
return HTTP_OK;
}
Rewrite Module
Inspired by Apache httpd’s mod_rewrite, a module with similar
functionality has been implemented in Lwan. Instead of a bespoke syntax,
though, it can be set up using the same configuration file syntax;
albeit more verbose, it’s significantly easier to understand.
Based on Lua’s pattern matching engine, which were chosen because they’re not only powerful enough, but it’s also a DFA, which makes it a lot harder for an unbounded backreference to DoS the server (as it doesn’t support any).
This is how it looks in the configuration file:
# Instantiate the "rewrite" module to respond in "/pattern".
rewrite /pattern {
# Match /patternfoo..., and redirect to a new URL.
pattern foo/(%d+)(%a)(%d+) {
redirect to = /hello?name=pre%2middle%3othermiddle%1post
}
# Match /patternbar/... and rewrite as /hello...
pattern bar/(%d+)/test {
rewrite as = /hello?name=rewritten%1
}
# Use Lua to determine where the client should redirect to
pattern lua/redir/(%d+)x(%d+) {
expand_with_lua = true
redirect to = '''
function handle_rewrite(req, captures)
local r = captures[1] * captures[2]
return '/hello?name=redirected' .. r
end
'''
}
# Use Lua to determine how to rewrite the request
pattern lua/rewrite/(%d+)x(%d+) {
expand_with_lua = true
rewrite as = """function handle_rewrite(req, captures)
local r = captures[1] * captures[2]
return '/hello?name=rewritten' .. r
end"""
}
}
Without expand_with_lua set (or set to false), the expansion
rule is trivial: %n will expand to the n-th capture; everything else
will be copied verbatim. When set, a handle_rewrite() function has
to be defined; the req parameter contains the same metamethods
available for handlers in the lua module, and captures is a
table containing the pattern matches.
As a measure against bad configuration, URLs can’t be rewritten over 4 times (otherwise, a 500 Internal Error response is generated instead).
(This module prompted a change in the configuration file parser that allows a section to be “isolated”: while the file is being read, if it’s in a start section line, one can ask the configuration reader to isolate the section. What this does is that it creates a proxy configuration struct that has a view of only that particular section. This isolated configuration object is passed to a module, that can read as usual without having to worry about overreading the main configuration.)
Snippets
Non-boolean predicates
C lacks option types, but sometimes you can get creative with what you have. For instance, I like how elegant this small piece of code that determines the temporary directory turned out:
static const char *is_dir(const char *v)
{
struct stat st;
if (!v)
return NULL;
if (*v != '/')
return NULL;
if (stat(v, &st) < 0)
return NULL;
if (!S_ISDIR(st.st_mode))
return NULL;
if (!(st.st_mode & S_ISVTX)) {
lwan_status_warning(
"Using %s as temporary directory, but it doesn't have "
"the sticky bit set.",
v);
}
return v;
}
static const char *
get_temp_dir(void)
{
const char *tmpdir;
tmpdir = is_dir(secure_getenv("TMPDIR"));
if (tmpdir)
return tmpdir;
tmpdir = is_dir(secure_getenv("TMP"));
if (tmpdir)
return tmpdir;
tmpdir = is_dir(secure_getenv("TEMP"));
if (tmpdir)
return tmpdir;
tmpdir = is_dir("/var/tmp");
if (tmpdir)
return tmpdir;
tmpdir = is_dir(P_tmpdir);
if (tmpdir)
return tmpdir;
return NULL;
}
The is_dir() predicate returns the same input parameter if it turns
that its condition holds (or NULL if it doesn’t), and it is used to
drive the conditions in get_temp_dir() as well as the return value
for that function. I don’t think that this could get any cleaner.
(I was actually going to write a blog post on how to create temporary files safely – which these funcions are part of – but Lennart Poettering beat me to it.)
Boolean flags to bitmask without branching
Sometimes, it’s necessary to convert one set of bitwise mask into another set of bitwise mask. Something like this:
if (flag1) flags |= SOME_BIT_MASK1;
if (flag2) flags |= SOME_BIT_MASK2;
In order to get rid of those branches, one can extend the bool type
into typeof(flags), and shift it by log2(mask). Lwan does this
with some flags:
#define REQUEST_FLAG(bool_, name_) \
((enum lwan_request_flags)(((uint32_t)lwan->config.bool_) \
<< REQUEST_##name_##_SHIFT))
static_assert(sizeof(enum lwan_request_flags) == sizeof(uint32_t),
"lwan_request_flags has the same size as uint32_t");
Which is then used like so:
enum lwan_request_flags flags =
REQUEST_FLAG(proxy_protocol, ALLOW_PROXY_REQS) |
REQUEST_FLAG(allow_cors, ALLOW_CORS);
include_next and portability
One of the decisions I took early on while writing Lwan was to only support Linux, and think about portability later; this decision was influenced by the way the OpenBSD project approaches portability.
In retrospect, this was a good decision: this avoided many of the pitfalls associated in writing abstractions too early in the game. It also made the code cleaner: the abundance of C preprocessor usage, common in some portable code, hinders legibility and maintainability. Of course, this decision made it challenging to port it to other operating systems.
I was content with this decision – until people began asking for BSD and Mac
ports. With the exception of some system calls (e.g. epoll, or the Linux
sendfile variant), porting shouldn’t be surprising. Ideally, having the code
largely #ifdef free would be ideal, so I had to find a way to make this happen.
While reading the GCC manual, I found out about an extension – that also happens to be implemented by Clang – that fit perfectly this scenario: wrapper headers. It’s a C preprocessor extension that includes the next file in the include lookup path. With this extension, it’s possible to write our own substitute header files, named after standard header files:
#include_next <stdlib.h> /* Include stdlib.h from system includes */
#ifndef MISSING_STDLIB_H_
#define MISSING_STDLIB_H_
#if !defined(HAVE_MKOSTEMP)
int mkostemp(char *tmpl, int flags);
#endif
#if !defined(HAVE_REALLOCARRAY)
void *reallocarray(void *optr, size_t nmemb, size_t size);
#endif
#endif /* MISSING_STDLIB_H_ */
Have it in a directory named, say, “missing”, and modify the header lookup path
so it is looked up first by the compiler. This is easily accomplished in CMake
by specifying an include directory with the BEFORE option:
include_directories(BEFORE src/lib/missing)
(This just ensures that src/lib/missing will be passed before any other -I
argument to the compiler, regardless of the order any other
include_directories() macro is invoked. Your build system might differ, this
is copied straight from Lwan’s.)
Then it’s just the matter of implementing these functions in terms of other
functions available in the system, and code using it will be none the wise: a
#include <stdlib.h> line will include our wrapper header, which in turn will
include the system’s stdlib.h header; it then might define, in this example,
additional prototypes, based on what the build system could determine during the
configuration phase.
This way, most #ifdefs are hidden away in a single file, making it a lot easier
to maintain and read the code. No application-specific abstraction layer with
quirky semantics; just the familiar quirkiness from POSIX.
One of the things I’m particular proud of is the miniature epoll(7) implementation on top of kqueue (available in BSD systems). I considered moving Lwan to use an abstraction library (such as libevent or libuv) just for this, but was able to keep using its event-handling loop as is. Not only I understand 100% of it, it was a worthwhile learning experience. With ~120 lines of C code, this epoll implementation is easier to wrap my head around than the thousands of lines of code from those libraries.
More on string switch in C
Talking about uncommon programming tricks the other day, the subject of switching on strings in C appeared on the table.
If you follow this blog, you know it’s something actually possible to do with a little bit of ingenuity; in fact, it’s one of the things that I use in Lwan to parse the HTTP requests. I didn’t spend too much time in that blog post to explain why it is faster, so I’m rectifying this now.
In order to understand why it’s so fast, let me step aside for a moment and
show a function every C programmer should be able to write: strlen().
size_t strlen(const char *s) {
size_t len = 0;
while (*s != '\0') {
len++;
s++;
}
return len;
}
Style issues aside, this is pretty much the simplest way to implement this function. And, maybe the one that generates the slowest code. There are many reasons here, so let’s explore some of them.
One of them is that CPUs are able to fetch more than a single byte at a time
from memory (or cache). And it takes roughly the same time for it to fetch
8 bits than it takes to fetch 32 bits. People that write C libraries know
about this fact, so that the version that your operating system provides is
most likely going to exploit this. Let’s rewrite the strlen() function,
then:
size_t strlen(const char *s) {
size_t len = 0;
while (true) {
uint32_t v = *(uint32_t *)s;
if ((v & 0xff) == '\0') return len;
if (((v >> 8) & 0xff) == '\0') return len + 1;
if (((v >> 16) & 0xff) == '\0') return len + 2;
if (((v >> 24) & 0xff) == '\0') return len + 3;
len += 4;
s += 4;
}
}
There is a big issue in this code: it is invoking undefined behavior. The explanation of why this is illegal in C will follow, but for now, let’s just assume that the integer cast from a character pointer is valid.
With that assumption in mind, what this program is doing is in fact very simple: it is reducing the amount of expensive operations (fetching things from memory), and is increasing the amount of cheap operations (masking bits and comparing integers). In fact, that’s a recurring theme whenever you’re trying to optimize any algorithm: the computer can’t run a piece of code faster, but you can make the computer run smarter code that produces the same output.
This program, however, will most likely crash on some platforms, or be really slow on others. The reason is that it is trying to read a pointer that is not aligned. Some CPU designers decided to increase the complexity and perform more work behind the scenes to make this work, and some decided it wasn’t worth the trouble and will just generate an exception. The major problem, however, is that this is precisely the undefined behavior I was talking about. So, let’s fix this function, by modifying it slightly so that the fast path operates on aligned pointers:
static inline bool is_ptr_aligned(const void *ptr) {
uintptr_t p = (uintptr_t)ptr;
/* Assuming a 32-bit machine with 4-byte alignment for
* integers. Executing "AND (n - 1)" is equivalent to "MOD
* (n)", without an expensive division operation; this is
* true for every (n), as long as (n) is a power of 2.
* Compilers can do this optimization on constant values,
* but I prefer to be explicit. */
return (p & 0x3) == 0;
}
size_t strlen(const char *s) {
size_t len = 0;
/* Read one byte at a time until the pointer is aligned. */
while (!is_ptr_aligned(s)) {
if (*s == '\0')
return len;
s++;
len++;
}
/* Pointer is aligned, try the faster version now. */
while (true) {
uint32_t v = *(uint32_t *)s;
if ((v & 0xff) == '\0') return len;
if (((v >> 8) & 0xff) == '\0') return len + 1;
if (((v >> 16) & 0xff) == '\0') return len + 2;
if (((v >> 24) & 0xff) == '\0') return len + 3;
len += 4;
s += 4;
}
}
There is still an even faster way, that is to improve the way the NUL byte is found in a word. The excellent (and one of my favorite websites) “Bit Twiddling Hacks” web page has a method to find out if a word contains a NUL byte; it doesn’t tell which byte is the NUL byte, but we don’t need to know that in the fast path:
size_t strlen(const char *s) {
size_t len = 0;
uint32_t v;
/* Read one byte at a time until the pointer is aligned. */
while (!is_ptr_aligned(s)) {
if (*s == '\0')
return len;
s++;
len++;
}
/* Pointer is aligned, try the faster version now. */
while (true) {
v = *(uint32_t *)s;
/* Use a fast bit twiddling hack to find if the
* next 4 bytes in the string has a 0 byte. If
* it does, find out which byte it is. */
if ((v - 0x01010101) & ~v & 0x80808080)
break;
len += 4;
s += 4;
}
if ((v & 0xff) == '\0') return len;
if (((v >> 8) & 0xff) == '\0') return len + 1;
if (((v >> 16) & 0xff) == '\0') return len + 2;
return len + 3;
}
Another thing to consider in these functions is that they’re not endian neutral. A decent implementation would ensure that it would work both on little-endian and on big-endian machines. A decent programmer would even build and test these; I didn’t.
Yet another thing to consider is that hand-tuned assembly versions, most likely written to make use of vector instructions, are the ones that your computer are executing at this very instant; but they all draw from these very same ideas: read memory less often, compare in bulk.
Also, there are most likely better ways to write these functions, even without vector instructions. But this is besides the point of explaining why the string switch trick works so well.
Now, this kind of optimization happens on pratically all string handling functions
in the C standard library. And functions that perform substring comparison,
such as strncmp(), are no exception.
When faced with the necessity to check for a bunch of strings, the idiomatic C way of doing so would be the following:
if (!strncmp(s, "GET ", 4))
return HTTP_METHOD_GET;
if (!strncmp(s, "POST ", 5))
return HTTP_METHOD_POST;
if (!strncmp(s, "DELETE ", 7))
return HTTP_METHOD_DELETE;
if (!strncmp(s, "HEAD ", 5))
return HTTP_METHOD_HEAD;
/* ... */
return HTTP_METHOD_UNKNOWN;
It’s not difficult to imagine that each invocation of strncmp() would have to
do things that are similar to what our toy strlen() implementation had to do:
align the pointer (which is slow) before it could start the fast path. But, in
this case, things are even worse, because, if the pointer isn’t aligned, it
might not even get to the point where the fast path will make sense, because the
strings it is comparing against are very close to the alignment of the word
size for this computer!
So, to recap, what the string switch does is the following:
static inline uint32_t string_to_uint32(const char *s) {
uint32_t v;
memcpy(&v, s, sizeof(v));
return v;
}
#define STRING_SWITCH(s) switch (string_to_uint32(s))
#define MCC(a, b, c, d) ((a) << 24 | (b) << 16 | (c) << 8 | (d))
enum {
METHOD_GET = MCC('G', 'E', 'T', ' '),
METHOD_POST = MCC('P', 'O', 'S', 'T'),
METHOD_DELETE = MCC('D', 'E', 'L', 'E'),
METHOD_HEAD = MCC('H', 'E', 'A', 'D'),
/* ... */
};
STRING_SWITCH(s) {
case METHOD_GET:
return HTTP_METHOD_GET;
case METHOD_POST:
return HTTP_METHOD_POST;
case METHOD_DELETE:
return HTTP_METHOD_DELETE;
case METHOD_HEAD:
return HTTP_METHOD_HEAD;
/* ... */
default:
return HTTP_METHOD_UNKNOWN;
}
The good thing about the switch statement in C is that it is maybe the
highest level statement in the language: the compiler can get really
creative in how its code is generated. It’s not uncommon for it to generate
jump tables or even binary searches. So this implementation would actually
be faster than the various calls to strncmp() because:
Comparing integers is dirt cheap.
The compiler knows what
memcpy()does, so it’s very likely that on architectures where unaligned memory access is fine and there’s no performance penalty (any Intel Core CPU after Sandy Bridge for instance), it’ll be just a regular old MOV instruction when the size is small and known at compile time.Even if the compiler didn’t know what
memcpy()does, it would only fill a register once, by doing potentially expensive byte-by-byte copies because of unaligned pointer access, and then proceed to just compare integers.There is less function call overhead; specially nice since this is most likely not going through the PLT.
The compiler can reorder the comparisons as it see fit, often producing very tight code.
These kinds of micro-optimizations don’t necessarily have to be completely unreadable and full of magical constants.
Coreboot & LUKS
My laptop is a 6-year old ThinkPad X220. Although it’s almost falling apart from years of constant abuse, I don’t see myself replacing it anytime soon: it’s easy to repair, has a great keyboard, and is a very dependable machine.
And it’s supported by Coreboot. Substituting the proprietary firmware with it is very trivial: I followed the instructions on this blog post and they worked out of the box. (I also went the extra mile and flashed the firmware after passing it through me_cleaner.)
Flashing the serial flash using a Raspberry Pi 3. Yes, I need to clean up this computer.
The major difference from my previous setup is that my SSD had hardware-based full disk encryption. I ended up disabling this for two reasons: first, this isn’t very secure (the key will remain in the disk RAM for as long as power is supplied); second, I was not sure if Coreboot supported this. So I disabled encryption prior to flashing the new firmware.
But keeping a hard drive unencrypted on a laptop isn’t good practice. I decided to use LUKS instead.
However, instead of using SeaBIOS
as the payload and have a standard bootloader, I opted to go through a
slightly different route: have a custom-built Linux inside the ROM, open the
/boot partition with LUKS, and
kexec the current
vmlinuz/initrd.
Compared to the usual setup of using SeaBIOS as a payload, this setup reduces boot time by cutting the middlemen. With the ability to boot from external devices removed, it’s also arguably more secure. The in-ROM Linux has only the bare minimum: no network subsystem, only necessary filesystems, bare minimum drivers are built-in, USB is limited to HID devices, etc; the compressed kernel has ~1.7MiB with room to shrink. The in-ROM initrd is also quite minimal, containing just one file.
The only file is a hacked version of
cryptsetup that acts as a primitive init, creating /proc, /dev (and mounting
these two), and /boot, decrypting /boot, and performing kexec. It’s
statically linked with musl libc.
Flashing this requires opening the laptop, and I’m planning to do this next weekend when replacing the USB ports. However, the setup works very well under QEMU.
This blog post isn’t meant as a tutorial – feel free to contact me if you have questions or ideas on how to improve this. If you end up using something similar to this idea, I’d love to know as well.
Parsing JSON
There are many libraries out there to parse JSON files. It might be a futile attempt, then, to write yet another one. However, when you’re working on a RTOS where memory is golden, and the alternatives don’t look that great, you got to do something about it.
Recently I wrote a JSON parser for a project at work. This parser uses constant memory, regardless of the amount of data it’s working with, and deserializes directly to a C struct. Similar, in spirit, to the JSON parser that’s part of the Golang standard library, that encodes and decodes data based on a tagged structure.
The lexer is the usual state machine, where the state itself is a function pointer to a function that handles that particular state.
I’ve been using this technique for a while, and I found that it’s a very clean and efficient way of describing state machines, specially for lexers.
I began using it after a coworker wrote a parser for a DSL using it – and he got the idea from the – you guessed – Golang template package. (There’s a nice talk by Rob Pike about it – I recommend this talk not only for the lexing goodness, but also for the tips on how to evolve a concurrent design.)
The parser implementation itself is nothing to write home about. However, by using the same idea used in Lwan’s mustache template engine to obtain the variables, it manages to do some things that are not common in JSON parsers written in C:
It will accept only values of known types for a particular key.
It will save the decoded value directly in a struct field.
It won’t try to decode the same field twice.
The first point is crucial when working with data received from the network, which is precisely the kind of thing I’m dealing with. This avoids problems such as type confusion and such, and moves the responsibility of checking the types to the library rather than the user of the library.
By saving the decoded value directly into a struct field, it does use a predictable amount of memory. This is good, as it’s not going to balloon out of control, or require some guesswork to know beforehand how many tokens are going to be necessary to deserialize some values. The C compiler already knows exactly how many bytes a struct needs.
Some fields might be optional in a JSON blob. So the parser uses a bitmask to mark which fields have been decoded (and returns that, so that the library user can efficiently test if a value has been deserialized or not). Since it was easy to do, the library refuses to decode a key that has been deserialized before.
So, a typical use is the following:
/* First, define a struct to hold the values. */
struct values {
char *some_string;
int some_int;
bool some_bool;
};
/* Then, define a descriptor for that struct. */
static const struct json_descr values_descr[] = {
FIELD(struct values, some_string, JSON_TOK_STRING),
FIELD(struct values, some_int, JSON_TOK_NUMBER),
FIELD(struct values, some_bool, JSON_TOK_TRUE),
};
/* (FIELD is just a macro that saves the offsetof()
* each struct member so that a pointer can be produced
* afterwards.) */
/* It's now just a matter of parsing the JSON now. */
struct values values;
int32_t ret = json_parse(serialized, strlen(serialized),
values_descr, ARRAY_SIZE(values_descr), &values);
/* Bits 0, 1, and 2 of ret will be set if some_string,
* some_int, and some_bool have been successfully
* deserialized. */
Another thing that could be done – but that has not been implemented yet, is to do the opposite as well: the descriptor and a struct to produce JSON-encoded data. This has many advantages over the usual JSON libraries that require generating a JSON tree in memory just to serialize it afterwards.
And although I’m quite happy with this code, there are still some limitations that I’ll address whenever I have the need.
Mainly, there’s no way to parse nested objects or arrays. I’ve written code to do this but these changes haven’t gotten any fuzz-testing action so I’m holding them off it until my living room heater^Wcomputer has worked on the problem for at least a week.
Another one that’s not a deal breaker for an embedded OS is the lack of
floating pointing numbers (only integers for now). Parsing floating point
is trickier than it sounds, and
there’s no strtod() in Zephyr’s minimal libc.
As a minor issue to work around, there’s the return value: this limits the number of fields to be parsed to 32; that should be plenty for most uses. However, in the unlikely event that’s not sufficient, this can be worked around by having multiple descriptors.
And, finally, there’s the JSON encoding part that I’ve mentioned already.
The major problem with this piece of code, that can’t be fixed by writing more code, is that I can’t use it with Lwan due to licensing reasons: although it is open source, part of the Zephyr project, it is licensed under the Apache 2 license, which is incompatible with the GPL2+ used by Lwan (would have to bump it to [L]GPL3).
(For those that might ask how fast is it: it’s fast enough. I didn’t measure, I didn’t compare, and I don’t really care: it’s readable, maintainable, and does the job.)