This week, I had a particularly subtle issue that took me several days to find and fix. Although simple to fix, diagnosing the problem was difficult. This post involves a little tutorial on how to use Intel's AEP DIMMs, as well as descriptions of some issues that I encountered while using them.
I'm on a system with some Intel Optane Persistent Memory DIMMs, which contain large capacities of non-volatile memory. From here on out, I'll refer to them as "AEP", which stands for "Apache Pass", the memory technology that they use under the hood. The three most common ways of making this memory available to the operating system are:
For my research, I'm particularly interested in the third option, since most of my tools assume that memory tiers will be accessible as NUMA nodes, and onlining a set of AEP DIMMs as a NUMA node would allow me to use those tools without modification.
In general, these are the steps to prepare to go into NUMA mode:
ipmctl create -goal -socket 0 PersistentMemoryType=AppDirect
ipmctl
. However, you need ndctl
and daxctl
to bind this region to the appropriate kernel driver.
git clone https://github.com/pmem/ndctl.git
/etc/modprobe.d/daxctl.conf
. If it doesn't, then you didn't
globally install daxctl
to your system (it hardcodes some paths, so
running this command from the source directory doesn't work). Reboot again.
ipmctl
, using ndctl
:
ndctl list -R
ndctl create-namespace --region region0 -m dax
dax_pmem_compat
running:
lsmod | grep "dax_pmem_compat"
lsmod | grep kmem
The next step is where I've been confused. Initially, my version of the
daxctl
command didn't include any other commands: that is, from here on out,
you had to manually bind the namespace to the proper kernel driver. To do this
manually (again for a single memory region):
dax0.0
,
dax1.0
, etc. This will list the namespaces, and within these should be
the device that they're associated with:
ndctl list -N
device_dax
driver:
echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
kmem
driver:
echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
Once that's done, you should now check numastat -m
to see if you've got a
newly-created NUMA node. If it contains the capacity that you expect it to be,
you're done.
However, one time I've encountered a situation in which this wasn't the case.
Upon checking numastat
, the node had a size of 0 bytes. Internet
searches didn't seem to give any hints as to why this was. As it turns out,
getting the newest version of daxctl
(which includes the
online-memory
, offline-memory
, and
reconfigure-device
commands) fixed my issue, but with a caveat.
Using a new enough daxctl
, the last three steps can be replaced by
a simpler:
daxctl reconfigure-device --mode=system-ram all
This will unbind the device from the old driver and rebind it to the new driver. Crucially, though, it also onlines the memory regions, which is what I was missing when I encountered the 0-size NUMA node issue. Using this new method, the node now shows up with the appropriate capacity.
On my system, I have two NUMA nodes of DDR, 0 and 1. Each of these is 96GB. Node 2 is the AEP on socket 0, while node 3 is the AEP on socket 1. So: to bind an application to the memory on socket 1 (while preferring DDR and spilling onto AEP), I do:
numactl --preferred=1 numactl --membind=1,3 --cpunodebind=1 ./a.out
For applications that use less than approximately 200GB of RAM, this works just
fine. However, upon scaling them to use more than 200GB, I start encountering
issues: the OOM killer is suddenly killing my process, and if I set
vm.overcommit_memory
to 2 (thus forcing malloc
to
return NULL
before allocating more memory than is available), the
applications fail to allocate memory above ~200GB.
Searching for this issue doesn't return many results, either, and I can't seem to figure out why those allocations are failing. If I ignore the DDR node, binding only to the AEP, all allocations succeed, and I can scale my application to use a peak RSS of more than 700GB. However, immediately upon preferring the DDR and spilling onto one of the AEP nodes, the kernel OOMs upon reaching around 200GB, despite there being nearly 600GB of free memory available on node 3.
Nearly giving up, I finally check each of the memory regions that make up the AEP NUMA nodes. For node 3, checking the first gigabyte of memory looks like:
cat /sys/bus/node/devices/node3/memory1000/state
The value was online
. However, upon checking
valid_zones
in the same directory, it seems that the memory is in
ZONE_MOVABLE
, not ZONE_NORMAL
.
Reading up on this, it explains why
I could manually bind to specifically that node and use its fully capacity, but
not fault memory onto it. Since this memory was onlined as
ZONE_MOVABLE
, the most that the kernel can access is the minimum of
what the other nodes have: that is, the more-than-700GB node of AEP can only
have ~96GB faulted onto it, so that I get an OOM when I allocate above nearly
200GB of memory (96GB of DDR, plus 96GB of AEP which I fault to). This also
explains why binding directly to that node succeeds: userspace applications can
still use the full capacity of the node just fine, and numastat
-m
shows the full amount.
Searching further, I find out why the nodes were ZONE_MOVABLE
: the
daxctl
command onlines them that way, by doing e.g.
echo online_movable > /sys/bus/node/devices/node3/memory1000/state
to each of the memoryXXXX
directories for a particular node. While
this is fine for an application that binds directly to that node, a subtle issue
is that the node cannot be fully faulted onto if it has more capacity than the
minimum of all of your other NUMA nodes (as will usually be the case for AEP).
The first and simplest solution is to modify daxctl
. I chose this
one, as it was the quickest to implement. For release version v66
,
I edited line 1095 of daxctl/lib/libdaxctl.c
from
online_movable
to online
, then recompiled, rebooted,
and re-onlined the memory with this new version.
For those that want to use the manual method, there are two more solutions.
First, you can simply write a script that manually onlines each of the gigabytes
of memory for your AEP regions. Your script would essentially iterate over each
of the directories in /sys/bus/node/devices/nodeX/
, and echo
online > state
for each of the memoryXXXX
directories. This
would most likely be the simplest solution if you try the manual method above,
but end up with a
0-size NUMA node (and don't want to use daxctl
).
The solution that would be easier in the long-term, though, is to enable the
kernel configuration CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
, which
automatically onlines newly-hotplugged memory. The memory would then be
immediately onlined into ZONE_NORMAL
upon being bound to the
kmem
driver.