Locale-neutral case-insensitve string equality comparison 2022-09-15

After reading Daniel Stenberg's post on case-insensitive string comparison in cURL, I decided to take a stab at implementing something like this in Lwan for the same reasons. (Lwan used to use strcasecmp() before using the one described in this blog post.)

While this version isn't going to break any performance records – it's branchy and compares one character at a time – it's fairly compact and does not require tables to be built during startup like the OpenSSL version, mentioned in Daniel's blog.

It does use a very compact (32-bytes) table, hardcoded within isalpha_neutral(), though:

static inline int isalpha_neutral(char c)
{
    /* Use this instead of isalpha() from ctype.h because they consider
     * the current locale.  This assumes ASCII and CHAR_BIT == 8, which
     * are tested for in debug builds.  */
    static const unsigned char upper_table[32] = {
        0, 0, 0, 0, 0, 0, 0, 0, 254, 255, 255, 7, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0,   0,   0,   0, 0, 0, 0, 0,
    };
    unsigned char uc = (unsigned char)c;
    return upper_table[uc >> 3] & 1 << (uc & 7);
}

The core of the algorithm is nothing out of the ordinary: a linear scan through each character in both strings, checking which bits differ between characters.

We check if the character is an alphabetic character if and only if the 5th bit value differ between characters. That's because, in ASCII, the 5th bit differs between the upper and lower case versions of the same character; this significantly reduces calls to isalpha_neutral().

bool strcaseequal_neutral(const char *a, const char *b)
{
    for (;;) {
        char ca = *a++;
        char cb = *b++;

        /* See which bits are different in either character */
        switch (ca ^ cb) {
        case 0: /* ca and cb are the same: advance */
            if (ca == '\0') {
                /* If `ca` is 0 here, then cb must be 0 too, so we don't
                 * need to check both.  */
                return true;
            }
            continue;
        case 32: /* Only 5th bit is set: advance if either are uppercase
                  * ASCII characters, but differ in case only */
            /* If either is an uppercase ASCII character, then move on */
            if (isalpha_neutral(ca) || isalpha_neutral(cb))
                continue;
            /* Fallthrough */
        default:
            return false;
        }
    }
}

I did not bother comparing the performance against stracasecmp_l() from glibc, as this function is only used in non-performance critical parts of the code. Since most strings that are compared in Lwan are short (less than 16 characters), I haven't bothered making a SIMD version of this code as well.

It's also important to note that this only compares two strings against equality; it's not a direct substitute for strcasecmp() and locale-aware variants.

Note

To ensure that this function keeps working even if tweaks are made over the years, all debug builds are built with the following function, that is executed during startup (__attribute__((constructor))):

#ifndef NDEBUG
__attribute__((constructor)) static void test_strcaseequal_neutral(void)
{
    /* Sanity checks for isalpha_neutral() */
    static_assert(CHAR_BIT == 8, "sane CHAR_BIT value");
    static_assert('*' == 42, "ASCII character set");
    static_assert('0' == 48, "ASCII character set");
    static_assert('a' == 97, "ASCII character set");

    /* Tests for strcaseequal_neutral() */
    assert(strcaseequal_neutral("LWAN", "lwan") == true);
    assert(strcaseequal_neutral("LwAn", "lWaN") == true);
    assert(strcaseequal_neutral("SomE-HeaDer", "some-header") == true);

    assert(strcaseequal_neutral("SomE-HeaDeP", "some-header") == false);
    assert(strcaseequal_neutral("LwAN", "lwam") == false);
    assert(strcaseequal_neutral("LwAn", "lWaM") == false);
}
#endif

This trick of using an __attribute__((constructor)) function to test functions is something I'm using more and more in Lwan. Maybe it warrants its own blog post after cleaning it up a little bit as this practice isn't common in C programs?

Feel free to use this code in your program under the terms of the MIT license.

If you liked this post, consider getting me a coffee!