include_next and portability

One of the decisions I took early on while writing Lwan was to only support Linux, and think about portability later; this decision was influenced by the way the OpenBSD project approaches portability.

In retrospect, this was a good decision: this avoided many of the pitfalls associated in writing abstractions too early in the game. It also made the code cleaner: the abundance of C preprocessor usage, common in some portable code, hinders legibility and maintainability. Of course, this decision made it challenging to port it to other operating systems.

I was content with this decision – until people began asking for BSD and Mac ports. With the exception of some system calls (e.g. epoll, or the Linux sendfile variant), porting shouldn’t be surprising. Ideally, having the code largely #ifdef free would be ideal, so I had to find a way to make this happen.

While reading the GCC manual, I found out about an extension – that also happens to be implemented by Clang – that fit perfectly this scenario: wrapper headers. It’s a C preprocessor extension that includes the next file in the include lookup path. With this extension, it’s possible to write our own substitute header files, named after standard header files:

#include_next <stdlib.h> /* Include stdlib.h from system includes */

#ifndef MISSING_STDLIB_H_
#define MISSING_STDLIB_H_

#if !defined(HAVE_MKOSTEMP)
int mkostemp(char *tmpl, int flags);
#endif

#if !defined(HAVE_REALLOCARRAY)
void *reallocarray(void *optr, size_t nmemb, size_t size);
#endif

#endif /* MISSING_STDLIB_H_ */

Have it in a directory named, say, “missing”, and modify the header lookup path so it is looked up first by the compiler. This is easily accomplished in CMake by specifying an include directory with the BEFORE option:

include_directories(BEFORE src/lib/missing)

(This just ensures that src/lib/missing will be passed before any other -I argument to the compiler, regardless of the order any other include_directories() macro is invoked. Your build system might differ, this is copied straight from Lwan’s.)

missing directory tree

Then it’s just the matter of implementing these functions in terms of other functions available in the system, and code using it will be none the wise: a #include <stdlib.h> line will include our wrapper header, which in turn will include the system’s stdlib.h header; it then might define, in this example, additional prototypes, based on what the build system could determine during the configuration phase.

This way, most #ifdefs are hidden away in a single file, making it a lot easier to maintain and read the code. No application-specific abstraction layer with quirky semantics; just the familiar quirkiness from POSIX.

One of the things I’m particular proud of is the miniature epoll(7) implementation on top of kqueue (available in BSD systems). I considered moving Lwan to use an abstraction library (such as libevent or libuv) just for this, but was able to keep using its event-handling loop as is. Not only I understand 100% of it, it was a worthwhile learning experience. With ~120 lines of C code, this epoll implementation is easier to wrap my head around than the thousands of lines of code from those libraries.

More on string switch in C

Talking about uncommon programming tricks the other day, the subject of switching on strings in C appeared on the table.

If you follow this blog, you know it’s something actually possible to do with a little bit of ingenuity; in fact, it’s one of the things that I use in Lwan to parse the HTTP requests. I didn’t spend too much time in that blog post to explain why it is faster, so I’m rectifying this now.

In order to understand why it’s so fast, let me step aside for a moment and show a function every C programmer should be able to write: strlen().

size_t strlen(const char *s) {
    size_t len = 0;

    while (*s != '\0') {
        len++;
        s++;
    }

    return len;
}

Style issues aside, this is pretty much the simplest way to implement this function. And, maybe the one that generates the slowest code. There are many reasons here, so let’s explore some of them.

One of them is that CPUs are able to fetch more than a single byte at a time from memory (or cache). And it takes roughly the same time for it to fetch 8 bits than it takes to fetch 32 bits. People that write C libraries know about this fact, so that the version that your operating system provides is most likely going to exploit this. Let’s rewrite the strlen() function, then:

size_t strlen(const char *s) {
    size_t len = 0;

    while (true) {
        uint32_t v = *(uint32_t *)s;

        if ((v & 0xff) == '\0') return len;
        if (((v >> 8) & 0xff) == '\0') return len + 1;
        if (((v >> 16) & 0xff) == '\0') return len + 2;
        if (((v >> 24) & 0xff) == '\0') return len + 3;

        len += 4;
        s += 4;
    }
}

There is a big issue in this code: it is invoking undefined behavior. The explanation of why this is illegal in C will follow, but for now, let’s just assume that the integer cast from a character pointer is valid.

With that assumption in mind, what this program is doing is in fact very simple: it is reducing the amount of expensive operations (fetching things from memory), and is increasing the amount of cheap operations (masking bits and comparing integers). In fact, that’s a recurring theme whenever you’re trying to optimize any algorithm: the computer can’t run a piece of code faster, but you can make the computer run smarter code that produces the same output.

This program, however, will most likely crash on some platforms, or be really slow on others. The reason is that it is trying to read a pointer that is not aligned. Some CPU designers decided to increase the complexity and perform more work behind the scenes to make this work, and some decided it wasn’t worth the trouble and will just generate an exception. The major problem, however, is that this is precisely the undefined behavior I was talking about. So, let’s fix this function, by modifying it slightly so that the fast path operates on aligned pointers:

static inline bool is_ptr_aligned(const void *ptr) {
    uintptr_t p = (uintptr_t)ptr;

    /* Assuming a 32-bit machine with 4-byte alignment for
     * integers.  Executing "AND (n - 1)" is equivalent to "MOD
     * (n)", without an expensive division operation; this is
     * true for every (n), as long as (n) is a power of 2.
     * Compilers can do this optimization on constant values,
     * but I prefer to be explicit. */
    return (p & 0x3) == 0;
}

size_t strlen(const char *s) {
    size_t len = 0;

    /* Read one byte at a time until the pointer is aligned. */
    while (!is_ptr_aligned(s)) {
        if (*s == '\0')
            return len;
        s++;
        len++;
    }

    /* Pointer is aligned, try the faster version now. */
    while (true) {
        uint32_t v = *(uint32_t *)s;

        if ((v & 0xff) == '\0') return len;
        if (((v >> 8) & 0xff) == '\0') return len + 1;
        if (((v >> 16) & 0xff) == '\0') return len + 2;
        if (((v >> 24) & 0xff) == '\0') return len + 3;

        len += 4;
        s += 4;
    }
}

There is still an even faster way, that is to improve the way the NUL byte is found in a word. The excellent (and one of my favorite websites) “Bit Twiddling Hacks” web page has a method to find out if a word contains a NUL byte; it doesn’t tell which byte is the NUL byte, but we don’t need to know that in the fast path:

size_t strlen(const char *s) {
    size_t len = 0;
    uint32_t v;

    /* Read one byte at a time until the pointer is aligned. */
    while (!is_ptr_aligned(s)) {
        if (*s == '\0')
            return len;
        s++;
        len++;
    }

    /* Pointer is aligned, try the faster version now. */
    while (true) {
        v = *(uint32_t *)s;

        /* Use a fast bit twiddling hack to find if the
         * next 4 bytes in the string has a 0 byte. If
         * it does, find out which byte it is. */
        if ((v - 0x01010101) & ~v & 0x80808080)
            break;

        len += 4;
        s += 4;
    }

    if ((v & 0xff) == '\0') return len;
    if (((v >> 8) & 0xff) == '\0') return len + 1;
    if (((v >> 16) & 0xff) == '\0') return len + 2;

    return len + 3;
}

Another thing to consider in these functions is that they’re not endian neutral. A decent implementation would ensure that it would work both on little-endian and on big-endian machines. A decent programmer would even build and test these; I didn’t.

Yet another thing to consider is that hand-tuned assembly versions, most likely written to make use of vector instructions, are the ones that your computer are executing at this very instant; but they all draw from these very same ideas: read memory less often, compare in bulk.

Also, there are most likely better ways to write these functions, even without vector instructions. But this is besides the point of explaining why the string switch trick works so well.

Now, this kind of optimization happens on pratically all string handling functions in the C standard library. And functions that perform substring comparison, such as strncmp(), are no exception.

When faced with the necessity to check for a bunch of strings, the idiomatic C way of doing so would be the following:

if (!strncmp(s, "GET ", 4))
    return HTTP_METHOD_GET;
if (!strncmp(s, "POST ", 5))
    return HTTP_METHOD_POST;
if (!strncmp(s, "DELETE ", 7))
    return HTTP_METHOD_DELETE;
if (!strncmp(s, "HEAD ", 5))
    return HTTP_METHOD_HEAD;
/* ... */
return HTTP_METHOD_UNKNOWN;

It’s not difficult to imagine that each invocation of strncmp() would have to do things that are similar to what our toy strlen() implementation had to do: align the pointer (which is slow) before it could start the fast path. But, in this case, things are even worse, because, if the pointer isn’t aligned, it might not even get to the point where the fast path will make sense, because the strings it is comparing against are very close to the alignment of the word size for this computer!

So, to recap, what the string switch does is the following:

static inline uint32_t string_to_uint32(const char *s) {
    uint32_t v;

    memcpy(&v, s, sizeof(v));

    return v;
}

#define STRING_SWITCH(s) switch (string_to_uint32(s))
#define MCC(a, b, c, d) ((a) << 24 | (b) << 16 | (c) << 8 | (d))
enum {
    METHOD_GET = MCC('G', 'E', 'T', ' '),
    METHOD_POST = MCC('P', 'O', 'S', 'T'),
    METHOD_DELETE = MCC('D', 'E', 'L', 'E'),
    METHOD_HEAD = MCC('H', 'E', 'A', 'D'),
    /* ... */
};

STRING_SWITCH(s) {
case METHOD_GET:
    return HTTP_METHOD_GET;
case METHOD_POST:
    return HTTP_METHOD_POST;
case METHOD_DELETE:
    return HTTP_METHOD_DELETE;
case METHOD_HEAD:
    return HTTP_METHOD_HEAD;
/* ... */
default:
    return HTTP_METHOD_UNKNOWN;
}

The good thing about the switch statement in C is that it is maybe the highest level statement in the language: the compiler can get really creative in how its code is generated. It’s not uncommon for it to generate jump tables or even binary searches. So this implementation would actually be faster than the various calls to strncmp() because:

  1. Comparing integers is dirt cheap.
  2. The compiler knows what memcpy() does, so it’s very likely that on architectures where unaligned memory access is fine and there’s no performance penalty (any Intel Core CPU after Sandy Bridge for instance), it’ll be just a regular old MOV instruction when the size is small and known at compile time.
  3. Even if the compiler didn’t know what memcpy() does, it would only fill a register once, by doing potentially expensive byte-by-byte copies because of unaligned pointer access, and then proceed to just compare integers.
  4. There is less function call overhead; specially nice since this is most likely not going through the PLT.
  5. The compiler can reorder the comparisons as it see fit, often producing very tight code.

These kinds of micro-optimizations don’t necessarily have to be completely unreadable and full of magical constants.

Coreboot & LUKS

My laptop is a 6-year old ThinkPad X220. Although it’s almost falling apart from years of constant abuse, I don’t see myself replacing it anytime soon: it’s easy to repair, has a great keyboard, and is a very dependable machine.

And it’s supported by Coreboot. Substituting the proprietary firmware with it is very trivial: I followed the instructions on this blog post and they worked out of the box. (I also went the extra mile and flashed the firmware after passing it through me_cleaner.)

flashing X220 bios

Flashing the serial flash using a Raspberry Pi 3. Yes, I need to clean up this computer.

The major difference from my previous setup is that my SSD had hardware-based full disk encryption. I ended up disabling this for two reasons: first, this isn’t very secure (the key will remain in the disk RAM for as long as power is supplied); second, I was not sure if Coreboot supported this. So I disabled encryption prior to flashing the new firmware.

But keeping a hard drive unencrypted on a laptop isn’t good practice. I decided to use LUKS instead.

However, instead of using SeaBIOS as the payload and have a standard bootloader, I opted to go through a slightly different route: have a custom-built Linux inside the ROM, open the /boot partition with LUKS, and kexec the current vmlinuz/initrd.

Compared to the usual setup of using SeaBIOS as a payload, this setup reduces boot time by cutting the middlemen. With the ability to boot from external devices removed, it’s also arguably more secure. The in-ROM Linux has only the bare minimum: no network subsystem, only necessary filesystems, bare minimum drivers are built-in, USB is limited to HID devices, etc; the compressed kernel has ~1.7MiB with room to shrink. The in-ROM initrd is also quite minimal, containing just one file.

The only file is a hacked version of cryptsetup that acts as a primitive init, creating /proc, /dev (and mounting these two), and /boot, decrypting /boot, and performing kexec. It’s statically linked with musl libc.

Flashing this requires opening the laptop, and I’m planning to do this next weekend when replacing the USB ports. However, the setup works very well under QEMU.

This blog post isn’t meant as a tutorial – feel free to contact me if you have questions or ideas on how to improve this. If you end up using something similar to this idea, I’d love to know as well.

Parsing JSON

There are many libraries out there to parse JSON files. It might be a futile attempt, then, to write yet another one. However, when you’re working on a RTOS where memory is golden, and the alternatives don’t look that great, you got to do something about it.

Recently I wrote a JSON parser for a project at work. This parser uses constant memory, regardless of the amount of data it’s working with, and deserializes directly to a C struct. Similar, in spirit, to the JSON parser that’s part of the Golang standard library, that encodes and decodes data based on a tagged structure.

The lexer is the usual state machine, where the state itself is a function pointer to a function that handles that particular state.

I’ve been using this technique for a while, and I found that it’s a very clean and efficient way of describing state machines, specially for lexers.

I began using it after a coworker wrote a parser for a DSL using it – and he got the idea from the – you guessed – Golang template package. (There’s a nice talk by Rob Pike about it – I recommend this talk not only for the lexing goodness, but also for the tips on how to evolve a concurrent design.)

The parser implementation itself is nothing to write home about. However, by using the same idea used in Lwan’s mustache template engine to obtain the variables, it manages to do some things that are not common in JSON parsers written in C:

  • It will accept only values of known types for a particular key.
  • It will save the decoded value directly in a struct field.
  • It won’t try to decode the same field twice.

The first point is crucial when working with data received from the network, which is precisely the kind of thing I’m dealing with. This avoids problems such as type confusion and such, and moves the responsibility of checking the types to the library rather than the user of the library.

By saving the decoded value directly into a struct field, it does use a predictable amount of memory. This is good, as it’s not going to balloon out of control, or require some guesswork to know beforehand how many tokens are going to be necessary to deserialize some values. The C compiler already knows exactly how many bytes a struct needs.

Some fields might be optional in a JSON blob. So the parser uses a bitmask to mark which fields have been decoded (and returns that, so that the library user can efficiently test if a value has been deserialized or not). Since it was easy to do, the library refuses to decode a key that has been deserialized before.

So, a typical use is the following:

/* First, define a struct to hold the values. */
struct values {
    char *some_string;
    int some_int;
    bool some_bool;
};

/* Then, define a descriptor for that struct. */
static const struct json_descr values_descr[] = {
    FIELD(struct values, some_string, JSON_TOK_STRING),
    FIELD(struct values, some_int, JSON_TOK_NUMBER),
    FIELD(struct values, some_bool, JSON_TOK_TRUE),
};
/* (FIELD is just a macro that saves the offsetof()
 * each struct member so that a pointer can be produced
 * afterwards.)  */

/* It's now just a matter of parsing the JSON now. */
struct values values;
int32_t ret = json_parse(serialized, strlen(serialized),
    values_descr, ARRAY_SIZE(values_descr), &values);

/* Bits 0, 1, and 2 of ret will be set if some_string,
 * some_int, and some_bool have been successfully
 * deserialized.  */

Another thing that could be done – but that has not been implemented yet, is to do the opposite as well: the descriptor and a struct to produce JSON-encoded data. This has many advantages over the usual JSON libraries that require generating a JSON tree in memory just to serialize it afterwards.

And although I’m quite happy with this code, there are still some limitations that I’ll address whenever I have the need.

Mainly, there’s no way to parse nested objects or arrays. I’ve written code to do this but these changes haven’t gotten any fuzz-testing action so I’m holding them off it until my living room heater^Wcomputer has worked on the problem for at least a week.

Another one that’s not a deal breaker for an embedded OS is the lack of floating pointing numbers (only integers for now). Parsing floating point is trickier than it sounds, and there’s no strtod() in Zephyr’s minimal libc.

As a minor issue to work around, there’s the return value: this limits the number of fields to be parsed to 32; that should be plenty for most uses. However, in the unlikely event that’s not sufficient, this can be worked around by having multiple descriptors.

And, finally, there’s the JSON encoding part that I’ve mentioned already.

The major problem with this piece of code, that can’t be fixed by writing more code, is that I can’t use it with Lwan due to licensing reasons: although it is open source, part of the Zephyr project, it is licensed under the Apache 2 license, which is incompatible with the GPL2+ used by Lwan (would have to bump it to [L]GPL3).

(For those that might ask how fast is it: it’s fast enough. I didn’t measure, I didn’t compare, and I don’t really care: it’s readable, maintainable, and does the job.)

Infect to Protect

Bandwagons

I’m not one to jump on each and every bandwagon I see. Sometimes that’s a good decision, sometimes it’s better to just wait and see where they go before taking any action.

Containers are one of those ideas that, while promising and intriguing, were quite clumsy in the beginning, so I ignored them for a good while. It’s sufficiently mature now; so much so that’s quite difficult to ignore them. Time to investigate them again.

Now, most of my work revolve around writing embedded software that runs on bare metal; containers don’t really solve any work-related problem I have. For personal usage, package management is more than sufficient to install programs. However, the sandbox aspect of containers are quite interesting and it’s something I’d like to know more about.

There are many articles around the web explaining how containers on Linux work. Some get out of their way to explain in depth all the machinery necessary to make them work, so there’s no need to repeat it here.

But, in sum: almost all of the kernel side of things was already present before containers were actually a thing: cgroups, system call filters, etc. Containers (and their runtimes) only make them so simple to use it’s transparent for the user.

I usually have a hard time understanding things that I cannot build, so I decided to build a toy container runtime. It’s crude and it’s a far cry from what any industrial-strength container runtime is capable of, but it’s not only a start, it’s implemented in a way that makes things a lot easier for the user.

Virulent tutorials

Before I go into details on how my contraption works, a little bit of background. I’ve been using Linux for over 18 years, and began my forays in C about 14 years ago.

Around that time, a pretty interesting HOWTO explaining how to create viruses for ELF binaries came out. It explained not only various methods of infecting an ELF executable, but also methods to detect them. Suffice to say, I couldn’t understand a thing back then. A few months ago, though, a conversation in the local hackerspace brought up that tutorial; I could now finally not only understand the techniques but put them to use.

One of the techniques explained in the HOWTO involves finding some unused space in an ELF segment that’s also executable, writing shellcode to that area, rewiring the executable’s entry point to point to the shell code, and modifying the shell code so that it points to the original entry point. It’s all quite Rube Goldberg-ey, but it’s actually quite simple.

This way, a chunk of code can be executed every time that program starts, without altering the size of the program. The perfect crime.

Perfect crime

Dual use technology

By now, you’ve most likely connected the dots: the idea is to use the very same technique, originally designed for viruses, to create a program that transforms any program into a sandboxed version of itself.

The prototype I wrote is very elementary; the only thing it does is limiting, just once, which system calls a program can execute.

Sort of a less-powerful version of OpenBSD’s pledge(2) (née tame(2)), which can be repeatedly called to reduce the amount of privileges a process has. Useful, for instance, in cases where a configuration file has to be read before processing user-supplied work. That BSD version has been sprinkling calls to pledge() in almost all of the programs in the base install (which is easier for a BSD system, since everything is kept under the same roof.)

But, unlike pledge(2), this thing can be applied to binaries that have been already built. No source code modifications are necessary. If your distribution can withstand the stench, “infected” binaries could be a thing in the default installation.

Filtering the system calls

Any respectable container runtime will perform a lot of tasks to sandbox a process and their children. So, for a proof of concept, I decided to do just the bare minimum: limit system calls using Seccomp-BPF.

Seccomp is a set of features present in the Linux kernel, since the 2.6.x days, that allows restricting what a program can do, system call-wise. The original intent was to do not permit any other system calls excepting those to end the program, and read and write to already-opened file descriptors. In some scenarios, this is perfectly acceptable. For others, there’s the seccomp-BPF extension.

BPF stands for Berkeley Packet Filter. A famous use of BPFs is in the tcpdump program, where rules such as “only give me back TCP fragments with the RST flag set” can be passed to the kernel; packets that don’t match the filter are not copied back to the userland, reducing a lot of the chatter between the two lands.

Obviously, this must be extremely performant, since kernel time must be conserved at all costs (the kernel is there only to serve userland, after all). Linux has many ways to speed up BPF programs, including an in-kernel JIT compiler. Some restrictions are in place that wouldn’t allow BPF programs to take an infinite amount of time to execute, and this blog post is a good introductory reading material on the subject.

Another, slightly less famous use of BPFs is with the seccomp-BPF extension. Instead of filtering network packets, processes can, for instance, pick which system calls they’re allowed to perform. And that’s precisely what’s necessary for my proof of concept.

Scripting like a kid

There are many ways to skin a cat. I decided to take a look how other programs were doing their sandboxes, and eventually found one that seemed easy enough to copy the technique from.

Unfortunately, writing shellcodes in C isn’t that easy, specially if you don’t know which C library a program was linked with (or if it were linked to a C library in the first place). Luckily, all the shellcode has to do is make two system calls, which is straightforward to do with a little bit of assembly.

The first call will forbid the process from getting more privileges. The second call will actually copy the BPF program to the kernel side.

The first call is painless: just set a few registers, invoke the syscall, done.

The other one takes a little bit more work. A few things helped: I’ve used nasm, which is a macro assembler, and wrote a few macros that let me write BPF programs as if they were standard x86-64 instructions.

The remaining issue is that a pointer to the BPF program must be passed to the call to prctl(), and the shellcode must be relocatable. A common trick to perform in these scenarios is to employ the fact that, on x86, when a call instruction is made, the return address (i.e. the address of the byte right after the call instruction) is pushed to the stack:

    ; …
    jmp push_bpf_addr
apply_filter:
    pop rdx     ; rdx points to the BPF program
    ; …
push_bpf_adr:
    call apply_filter
bpf:
    bpf_stmt ; …
    bpf_jump ; …
    sc_allow ; …
    ; …
bpf_end:

The bpf label doesn’t point to any x86 instruction: it contains only macros that expands to the definitions of struct sock_filter as defined in linux/filter.h. To copy the BPF program to the kernel, the prctl() call expects a struct sock_fprog, which contains the BPF program length (in number of struct sock_filter elements), and a pointer to the base of that array. Since there’s no way to know where this code is gong to land in memory beforehand, this trick comes in handy: after the call apply_filter instruction, the top of the stack now contains the base address of that array.

Now that I had a way to write the shellcode, it was just the matter of shoehorning it into the executable.

Hacking time

Scoring a goal

For the proof of concept, I was initially going to write the infection program in Python, as I usually do for throwaway code. However, I wasn’t successful in finding a working ELF library that would let me dump the modified executable.

I was too lazy to actually fix or write support for that, so I kept looking for alternatives and ended up finding the ELFkickers suite from the always excellent Muppet Labs. It includes an “infect” program that does exactly what says in the tin: it takes in an executable file, and produces another executable file that creates a setuid shell before continuing to the original program. Exactly what one would expect from a program with nefarious purposes.

So I substituted the original shellcode for the one I’ve just assembled, and now I had a proof of concept. Which of course didn’t work the first few tries. In fact, it took a long while to get it right.

Debugging the contraption with gdb

The GNU Debugger is indeed very powerful, but ease of use (compared to the Turbo Debugger I used to use in the DOS days) is not it’s strong suit. I’m not used to using it to debug programs without access to source, and this was a good opportunity to learn a few things.

Since the infection program modifies the ELF entry point, setting a breakpoint on main() won’t actually work. But this is easily solvable: just use readelf(1) to find where the new entry point is, and set a breakpoint to that:

$ gcc -o hello hello.c
$ readelf -h hello | grep Entry
  Entry point address: 0x400490
$ ./infect hello
$ readelf -h hello | grep Entry
  Entry point address: 0x4007bc
$ gdb ./hello

(gdb) break *0x4007bc
Breakpoint 1 at 0x4007bc

From now on, it’s just the usual execute-inspect-modify-reassemble-reinfect loop until it works. Although it’s no td, I’m certainly glad GDB has layouts that displays both the disassembly and the registers.

Step-by-step debugging

Watching the magic happen

The hello program is very short and the call to socket(2) doesn’t make much sense there. It’s just a way to test what’s going to happen when the filter is in place, without the need to modify the program to test this assumption. (Lots of things happens when executing a simple program such as this.)

#include <stdio.h>
#include <sys/socket.h>
#include <netinet/in.h>

int main(int argc, char *argv[])
{
    if (argc < 2) {
            printf("no socket created\n");
    } else {
            int fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
            printf("created socket, fd=%d\n", fd);
    }
    return 0;
}

Executing the program before infecting it gives the following output, as expected:

$ ./hello
no socket created
$ ./hello 1
created socket, fd = 3

Indeed, if the program is executed under strace, it all goes exactly like it’s supposed to be:

$ strace ./hello
execve("./hello", ["./hello"], [/* 58 vars */]) = 0
…
write(1, "no socket created\n", 18no socket created
)     = 18
exit_group(0)                           = ?
+++ exited with 0 +++

And, with a command-line argument, so the socket is created:

…
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
…
write(1, "created socket, fd = 3\n", 23created socket, fd = 3
) = 23
exit_group(0)                           = ?
+++ exited with 0 +++

However, the magic happens after the “infected” binary is executed. First, without creating a socket:

…
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)  = 0
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, {len=30, filter=0x400824}) = 0
…
write(1, "no socket created\n", 18no socket created
)     = 18
exit_group(0)                           = ?
+++ exited with 0 +++

Notice the calls to prctl(), very similar to the ones found in the previously-mentioned commit. And then the program executes as usual. Now, if an argument is passed, the program will attempt to create a socket:

…
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)  = 0
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, {len=30, filter=0x400824}) = 0
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 41
--- SIGSYS {si_signo=SIGSYS, si_code=SYS_SECCOMP, si_call_addr=0x7f2d01aa19e7, si_syscall=__NR_socket, si_arch=AUDIT_ARCH_X86_64} ---
+++ killed by SIGSYS (core dumped) +++
[1]    27536 invalid system call (core dumped)  strace ./hello 1

And Seccomp kicks in and kills the program with a SIGSYS signal. As expected. It’s alive!

It's alive!

Next steps

The prototype works. But there are a few things that must be considered before even considering this idea for anything.

System call whitelist

The list of system calls is still hardcoded within the shellcode. That’s not optimal. Maintaining a list such as this for each and every program will most likely be so boring nobody is going to do that.

I can think of three possible ways of coming up with this list.

The first would be doing the same thing pledge(2) does: allowing a very restrict set of system calls at first, with some limitations, and then providing a few sets of calls per set of features a program might use: stdio, inet, tty, etc. The nice thing about this is that the filters are more fine grained; it’s not just a whitelist of system calls. (The man page has more details.)

The second way would involve running the program under strace(1) and record which system calls the program makes from a few runs. If the test coverage for each run is sufficiently high, this will work very reliably; this isn’t always the case, so the mileage may vary. Also, for certain large, complicated programs, stracing it all automatically could prove to be a challenge.

Another way would be the following: Grab a list of undefined symbols a program uses, and find them in the shared libraries it links to. Then scan the executable and the libraries for sequences like mov eax, 57; syscall (for the oldschool fork(2) syscall on x86-64) or mov rdi, 57; call syscall@plt. This is still not foolproof, since not necessarily a system call number (loaded into eax) will be hardcoded within a program or shared library.

There’s a fourth idea, as well, which involves both doing the automated static analysis on the binary and running strace to catch “runaway” syscalls. This can get quite complicated and it’s unlikely I’ll get it correct in the first few tries (and, yet, the same shortcomings will apply in the end.)

For me, though, these experiments are all about the hunt, not about the treasure. So the tried and true approach that pledge(2) uses won’t be used at first.

Filter optimization

Another thing that might be a problem is: on x86-64, Linux has hundreds of system calls. (329 according to sys/syscall.h at the moment I write this.)

Even if the JIT for BPFs is quite efficient, doing a linear search before each and every system call will certainly be a bottleneck. Also, BPF programs are limited in size, and a large whitelist that’s implemented the same way as the prototype will limit the possibility for more fine-grained filters. Things like “the socket(2) call is allowed only for UNIX-domain sockets”, rather than allowing whatever call to socket(2) would be impractical.

Since each syscall is identified by a number, a simple bitmap could be used to implement the whitelist. This will also free up some space in the BPF program for more detailed whitelisting for certain syscalls (for instance, only allowing certain family of sockets to be created).

After a quick read of networking/filter.txt, this seems doable by using an algorithm such as this, which will reduce the number of comparisons as the number of acceptable system calls increases:

        if syscall_number < 32:
                if bitmask_0 & 1<<syscall_number: goto accept
        if syscall_number < 64:
                syscall_number -= 32
                if bitmask_1 & 1<<syscall_number: goto accept
        if syscall_number < 96:
                syscall_number -= 64
                if bitmask_2 & 1<<syscall_number: goto accept
        …
        if syscall_number < 352:
                syscall_number -= 320
                if bitmask_10 & 1<<syscall_number: goto accept
        return SECCOMP_RET_KILL
accept:
        return SECCOMP_RET_ACCEPT

(Some of the if syscall_number < N blocks could be changed to syscall_number -= M if their respective bitmask is 0.)

Or maybe just a bloom filter instead of a series of bitmaps. I’ll have to experiment.

Getting a larger vessel

Containers, of course, are not just about restricting which system calls a program is allowed to perform. There are many things that can and must be considered before even calling this a container runtime, or really consider that this is in fact sandboxing anything. Learning about namespaces, cgroups and virtual machines are certainly on the list of things to learn about.

Conclusion

While the prototype I built isn’t practical and is of very limited use, I find the idea of sandboxed programs without the need for specialized runtimes very enticing.

Programs can be still packaged the way they have been packaged in the past decades, without throwing away some of the sandboxing benefits that containers provide, all the while not introducing new concepts for users.

Of course, something like this – even if properly implemented – won’t be a replacement for containers. Specially if one considers their role as packets ready for deployment, which have a lot of value for devops personnel.

The code, as usual, is open source, and available from this Git repository.