Table of Contents
PUMA is a smart memory allocator that manages data in a NUMA-aware way. It exposes an interface to execute a kernel on the data in parallel, automatically ensuring that each core which runs the kernel accesses primarily local memory. It also provides an optional (primitive) time-based load balancer which can adapt workloads to cases where some cores may have be less powerful or have more to do per kernel invocation than others.
PUMA consists of several parts:
-
A NUMA-aware dynamic memory allocator for homogeneous elements;
-
An allocator for thread-local static data which cannot be freed individually. This allocator uses pools which are located on the domain to which the core associated with the thread belongs;
-
A parallel iteration interface which applies a kernel to all elements in a PUMA set;
-
A balancer which changes which threads blocks of data are associated with in order to balance kernel runtime across cores.
Compilation of PUMA requires a simple make
invocation in the PUMA root
directory. The make targets are as follows:
-
all: Build PUMA and docs and run unit tests
-
doc: Build documentation with doxygen
-
no_test: Build PUMA without running unit tests
-
clean: Clear the working tree
-
docs_clean: Clear all built documentation
PUMA relies on the following:
-
libNUMA
-
C99 compatible compiler
-
Valgrind (optional)
-
OpenMP (optional)
-
Doxygen (optional, documentation)
The following are configuration options for public use. For options which are either enabled or disabled, 1 enables and 0 disables.
-
PUMA_NODEPAGES: Specifies the number of pages to allocate per chunk in the per-thread chunk list. Default 2
-
OPENMP: Enable OpenMP. If disabled, we use PUMA’s pthread-based thread pooling solution (experimental). Default enabled
-
STATIC_THREADPOOL: If enabled and we are not using OpenMP, we share one thread pool amongst all instances of PUMASet. Default disabled
-
BINDIR: Where we place the build shared library. Default {pumadir}/bin
-
VALGRIND: Whether we build with valgrind support. Default enabled
The following are configuration options for use during PUMA development. They may severely hurt performance so should never be used in performance-critical code.
- DEBUG: Enable assertions. Default disabled
-
MPI support
-
Further language bindings (Python, Fortran, C++)
-
Windows support
-
Better userspace NUMA profiling tooling (could build on Valgrind or LLDB)
struct pumaSet* createPumaSet(size_t elementSize, size_t numThreads, char* threadAffinity);
Creates a new struct pumaSet
.
Arguments:
l X elementSize
& Size of each element in the set.
numThreads
& The number of threads we want to run pumaSet on.
threadAffinity
& An affinity string specifying the CPUs to which to
bind threads. Can contain numbers separated either by commas or dashes.
“i-j” means bind to every cpu from i to j inclusive. “i,j” means bind to
i and j. Formats can be mixed: for example, “0-3, 6, 10, 12, 15, 13” is
valid.If NULL
, binds each thread to the CPU whose number matches the
thread (tid 0 == cpu 0 :: tid 1 == cpu 1 :: etc.).If non-NULL
, must
specify at least as many CPUs as there are threads.
\
void destroyPumaSet(struct pumaSet* set);
Destroys and frees memory from the struct pumaSet
.\
size_t getNumElements(struct pumaSet* set);
Returns the total number of elements in the struct pumaSet
.\
typedef size_t (splitterFunc)(void* perElemBalData, size_t numThreads,
void* extraData);
Signature for a function which, given an element, the total number of threads and, optionally, a void pointer, will specify the thread with which to associate the element.
Arguments:
l X perElemBalData
& Per-element data passed into
pumallocManualBalancing()
which enables the splitter to choose the
placement of the associated element.
numThreads
& The total number of threads in use.
extraData
& Optional extra data, set by calling
pumaListSetBalancer()
.
\
void pumaSetBalancer(struct pumaSet* set, bool autoBalance, splitterFunc* splitter, void* splitterExtraData);
Sets the balancing strategy for a struct pumaSet
.
Arguments:
l X set
& Set to set the balancing strategy for.
autoBalance
& Whether to automatically balance the set across threads
prior to each kernel run.
splitter
& A pointer to a function which determines the thread with
which to associate new data when pumallocManualBalancing()
is called.
splitterExtraData
& A void pointer to be passed to the splitter
function each time it is called.
void* pumalloc(struct pumaSet* set);
Adds an element to the struct pumaSet
and returns a pointer to it. The
new element is associated with the CPU on which the current thread is
running.\
void* pumallocManualBalancing(struct pumaSet* set, void* balData);
Adds an element to the struct pumaSet
and returns a pointer to it.
Passes balData
to the set’s splitter function to determine the CPU
with which to associate the new element.\
void* pumallocAutoBalancing(struct pumaSet* set);
Adds an element to the struct pumaSet
and returns a pointer to it.
Automatically associates the new element with the CPU with the fewest
elements.\
void pufree(void* element);
Frees the specified element from its set.
struct pumaExtraKernelData
{
void* (*extraDataConstructor)(void* constructorData);
void* constructorData;
void (*extraDataDestructor)(void* data);
void (*extraDataThreadReduce)(void* data);
void (*extraDataReduce)(void* retValue, void* data[],
unsigned int nThreads);
void* retValue;
};
A descriptor of functions which handle extra data for kernels to pass into runKernel().
Members:
l X extraDataConstructor
& A per-thread constructor for
extra data which is passed into the kernel.
constructorData
& A pointer to any extra data which may be required by
the constructor. May be NULL
.
extraDataDestructor
& A destructor for data created with
extraDataConstructor()
.
extraDataThreadReduce
& A finalisation function which is run after the
kernel on a per-thread basis. Takes the per-thread data as an argument.
extraDataReduce
& A global finalisation function which is run after
all threads have finished running the kernel. Takes retValue, an array
of the extra data for all threads and the number of threads in use.
retValue
& A pointer to a return value for use by extraDataReduce. May
be NULL
.\
\
void initKernelData(struct pumaExtraKernelData* kernelData,
void* (*extraDataConstructor)(void* constructorData),
void* constructorData,
void (*extraDataDestructor)(void* data),
void (*extraDataThreadReduce)(void* data),
void (*extraDataReduce)(void* retValue, void* data[],
unsigned int nThreads),
void* retValue);
Initialises kernelData
. Any or all of the arguments after kernelData
may be NULL
. Any NULL
functions are set to dummy functions which do
nothing.\
extern struct pumaExtraKernelData emptyKernelData;
A dummy descriptor for extra kernel data. Causes NULL
to be passed to
the kernel in place of extra data.\
typedef void (*pumaKernel)(void* element, void* extraData);
The type signature for kernels which are to be run on a PUMA list.
Arguments:
l X element
& The current element in our iteration.
extraData
& Extra information specified by our extra data descriptor.\
\
void runKernel(struct pumaSet* set, pumaKernel kernel, struct pumaExtraKernelData* extraDataDetails);
Applies the given kernel to all elements in a struct pumaSet
.
Arguments:
l X set
& The set containing the elements to which we
want to apply our kernel.
kernel
& A pointer to the kernel to apply.
extraDataDetails
& A pointer to the structure specifying the extra
data to be passed into the kernel.\
\
void runKernelList(struct pumaSet* set, pumaKernel kernels[],
size_t numKernels, struct pumaExtraKernelData* extraDataDetails);
Applies the given kernels to all elements in a struct pumaSet
. Kernels
are applied in the order in which they are specified in the array.
Arguments:
l X set
& The set containing the elements to which we
want to apply our kernels.
kernels
& An array of kernels to apply.
numKernels
& The number of kernels to apply.
extraDataDetails
& A pointer to the structure specifying the extra
data to be passed into the kernels.\
void* pumallocStaticLocal(size_t size);
Allocates thread-local storage which resides on the NUMA domain to which the CPU which executes the function belongs.
Arguments:
l X size
& The number of bytes we want to allocate.
\
void* pumaDeleteStaticData(void);
Deletes all static data associated with the current thread.
In lieu of the traditional “Hello World” introductory program, we present a PUMA-based program which generates a large set of random numbers between 0 and 1 and uses the reduction mechanism of PUMA to calculate the set’s standard deviation.
In order to calculate the standard deviation, we require three things: a
kernel, a constructor for the per-thread data and a reduction function.
In the constructor, we use the pumallocStaticLocal()
function to
allocate a static variable on a per-thread basis which resides in memory
local to the core to which each thread is pinned.
This interface for allocating thread-local data are only intended to be used for static data whose lifespan extends to the end of the program. It is possible to delete all static data which is related to a thread, but it is more sensible to simply reuse the allocated memory each time we need similarly-sized data on a thread. This requires the use of pthread keys in order to retrieve the allocated pointer each time it is needed.
// puma.h contains all of the puma public API declarations we need.
#include "puma.h"
#include <math.h>
#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include <getopt.h>
pthread_key_t extraDataKey;
pthread_once_t initExtraDataOnce = PTHREAD_ONCE_INIT;
static void initialiseKey(void)
{
pthread_key_create(&extraDataKey, NULL);
}
struct stdDevExtraData
{
double sum;
double squareSum;
size_t numElements;
};
static void* extraDataConstructor(void* constructorData)
{
(void)pthread_once(&initExtraDataOnce, &initialiseKey);
void* stdDevExtraData = pthread_getspecific(extraDataKey);
if(stdDevExtraData == NULL)
{
stdDevExtraData = pumallocStaticLocal(sizeof(struct stdDevExtraData));
pthread_setspecific(extraDataKey, stdDevExtraData);
}
return stdDevExtraData;
}
static void extraDataReduce(void* voidRet, void* voidData[],
unsigned int nThreads)
{
double* ret = (double*)voidRet;
double sum = 0;
double squareSum = 0;
size_t numElements = 0;
for(unsigned int i = 0; i < nThreads; ++i)
{
struct stdDevExtraData* data = (struct stdDevExtraData*)voidData[i];
numElements += data->numElements;
sum += data->sum;
squareSum += data->squareSum;
}
double mean = sum / numElements;
*ret = squareSum / numElements - (mean * mean);
}
static void stdDevKernel(void* voidNum, void* voidData)
{
double num = *(double*)voidNum;
struct stdDevExtraData* data = (struct stdDevExtraData*)voidData;
data->sum += num;
data->squareSum += num * num;
++data->numElements;
}
static void staticDestructor(void* arg)
{
pumaDeleteStaticData();
}
Prior to running the kernel, we must actually create the
struct pumaSet
which contains our data; to do this, we specify the
size of our elements, the number of threads we wish to use and,
optionally, a string detailing what cores we want to pin threads to. We
must also seed the random number generator and read the arguments:
static void printHelp(char* invocationName)
{
printf("Usage: %s -t numThreads -e numElements [-a affinityString]\n"
"\tnumThreads: The number of threads to use\n"
"\tnumElements: The number of numbers to allocate\n"
"\taffinityString: A string which specifies which cores to run on.\n");
}
int main(int argc, char** argv)
{
int numThreads = 1;
int numElements = 1000;
char* affinityStr = NULL;
/*
Get command line input for the affinity string and number of threads.
*/
char c;
while ( (c = getopt(argc, argv, "e:a:t:")) != -1)
{
switch (c)
{
case 't':
numThreads = atoi(optarg);
break;
case 'e':
numElements = atoi(optarg);
break;
case 'a':
affinityStr = optArg;
break;
case 'h':
printHelp(argv[0]);
break;
}
}
struct pumaSet* set =
createPumaSet(sizeof(double), numThreads, affinityStr);
srand(time(NULL));
From here, we can use the pumalloc
call to allocate space within set
for each number:
for(size_t i = 0; i < numElements; ++i)
{
double* num = (double*)pumalloc(set);
*num = (double)rand() / RAND_MAX;
}
We then use initKernelData()
to create the extra data to be passed
into our kernel. From there, we call runKernel()
to invoke our kernel
and get the standard deviation of the set.
struct pumaExtraKernelData kData;
double stdDev = -1;
initKernelData(&kData, &extraDataConstructor, NULL, NULL, NULL,
&extraDataReduce, &stdDev);
runKernel(set, stdDevKernel, &kData);
printf("Our set has a standard deviation of %f\n"
"Also, Hello World!\n", stdDev);
Finally, we clean up after ourselves by destroying our set and all our
static data. The static data destructor destroys data on a per-thread
basis, so we must call the destructor from all threads in our pool. To
do this, we use the executeOnThreadPool()
function from
pumathreadpool.h
.
executeOnThreadPool(set->threadPool, staticDestructor, NULL);
destroyPumaSet(set);
}
In order to compile this tutorial, use the following command:
gcc -pthread -std=c99 <file>.c -o stddev -lpuma -L<PUMA bin dir> -I<PUMA inc dir>
PUMA is released under the three-clause BSD licence. We chose this rather than a copyleft licence like GPL or LGPL in order to allow anyone to use PUMA with absolute freedom aside from the inclusion of a short copyright notice.