Alignment Bug

September 26, 2019

The Issue

For the past few years, I've been using jemalloc to allocate memory in my experiments, partly because of the usefulness of arena allocation. Because these arenas allocate memory in multiple-megabyte chunks (or extents, as they're called in jemalloc), this allocation does not necessarily happen each time you request memory from the arena. Instead, jemalloc allocates an extent only when needed, then holds onto that extent until the arena is freed or until some other heuristic determines that it should be freed.

However, for my research, I need to be able to bind all arena memory to a specific NUMA node; that is, for every extent allocation, I need to call mbind on that range of addresses. To do something like this, jemalloc provides the extent_hook interface. Briefly, this feature allows you to define a set of function pointers that will be used for all memory operations in the arena: allocating, deallocating, committing, etc. Notably, we define a function called sa_alloc, which allocates a new extent to an arena.

sa_alloc accepts quite a few arguments, including the size of the extent that it should allocate (size), and the number of bytes that it should align that allocation by (alignment). The first argument is handled easily: simply call mmap and pass it the correct size. The second argument is a little trickier, but this is how the original code did it. I'll remove the error handling and other distracting elements. This code was also not originally written by me, so I've added my own comments:


/* Do  the initial allocation */
ret = mmap(new_addr, size, PROT_READ | PROT_WRITE, mmflags, sa->fd, sa->size);

/* Check if the allocation meets the alignment */
if (alignment == 0 || ((uintptr_t) ret)%alignment == 0) {
  goto success;
}

/* If it's not aligned properly, unmap and try again with a larger size */
munmap(ret, size);
size += alignment;
ret = mmap(NULL, size, PROT_READ | PROT_WRITE, mmflags, sa->fd, sa->size);

/* Chop off and unmap the excess */
n = (uintptr_t) ret;
m = n + alignment - (n%alignment);
munmap(ret, m-n);
ret = (void *) m;

/* Finally, call 'mbind' on the new extent */
success:
  mbind(ret, size, mpol, nodemaskp, maxnode, MPOL_MF_MOVE) < 0);

For several years, this worked for a variety of applications and never threw any errors. However, only recently and in certain situations, it began throwing extremely rare errors in one particular application: AMG. Specifically, the final mbind call would return Bad address, the function would fail, and the runtime library would fail to allocate to an arena. I should also note that while perror prints out Bad address, it's possible that the error is caused by some of the other arguments. Poking through the do_mbind function in the Linux kernel, it seems as if there are a great deal of other things that could cause that same EFAULT error.

The Search

To begin with, I didn't even consider the possibility that this code could be flawed. After all, it succeeded for several years before this, and I'd never seen it throw an error in the past. Nearly every day, I would run experiments that required this block of code, yet only now was it failing. After quickly printing out the arguments to mbind (they looked entirely ordinary), I started my search with what I considered to be more error-prone parts of the codebase.

Being an error that happens every so often, my first though was a race condition: something, I thought, must be interfering with this allocation in some way, perhaps changing some of the heap-allocated structures from which I get some of the arguments to mbind. However, all of these variables are protected by an arena-wide mutex, so it's not possible for two threads to allocate to a single arena simultaneously.

The last thing that I considered was that something had gone awry with onlining some of the memory nodes. As in my previous post, I had some difficulty with some of the blocks of memory on my system being ZONE_MOVABLE. This being a recent problem, coupled with the fact that the error failed more consistently when memory spilled onto other NUMA nodes, convinced me that mbind wasn't able to allocate to certain regions of one of the NUMA nodes. That, however, was quickly debunked by simply binding to different nodes, which resulted in the same issue.

The Solution

Finally, I took a harder look at the alignment code: were there rare situations in which it could fail, depending on some race condition? Printing out m, n, ret, etc., I suddenly realized the issue: after failing to get the correct alignment, the code adds alignment to the size. Then, after adjusting the pointer and unmapping the excess, size remains the same. This then gets passed to mbind, which is looking for a block of memory of size size. However, because part of this allocated block was subsequently unmapped, the size needs to be updated to reflect that.

Now, why did it take so long for this issue to finally crop up? How could I run experiments for several years without experiencing this issue even once? I think that in order for this issue to actually occur, a perfect storm of conditions must be true:

The application must request a large amount of memory aligned by a size larger than the page size.
The application must be extremely heavily threaded.
The application must use lots of allocations and deallocations from those threads.