Use a principled, and consistent, implementation of freelists. #132
Replies: 15 comments 2 replies
-
FWIW, I had tried similar idea but I couldn't optimize it enough. |
Beta Was this translation helpful? Give feedback.
-
Could this be done as a feature of pymalloc? pymalloc already has a number of the features you're looking for, particularly mapping from size to size-class-specific metadata. |
Beta Was this translation helpful? Give feedback.
-
@methane how did it compare? As long as it was no slower, it would be worth it just for the reduction in code size. @kmod, it is a bit arbitrary whether the free lists go inside the allocator or on top of it. |
Beta Was this translation helpful? Give feedback.
-
Cool idea! A quick look at Include/internal/pycore_interp.h shows the following freelists:
So all of these would change to use the new mechanism, right? What is the likelihood that any of these share a size? Other questions:
|
Beta Was this translation helpful? Give feedback.
-
I updated my legacy global-freelist with current main branch. |
Beta Was this translation helpful? Give feedback.
-
Inspired from Sam's nogil effort, I started to play with https://github.com/microsoft/mimalloc as replacement for libc's malloc and pymalloc. It comes with builtin freelist support and freelist sharding. If we move to to mimalloc, then we get lock-free freelists for free (three times free for free)! Pablo ran benchmarks for me:
The first case is slightly slower than vanilla Python. It is a surprisingly good result for a first attempt without any tuning. |
Beta Was this translation helpful? Give feedback.
-
Is not it already done at low level in obmalloc? |
Beta Was this translation helpful? Give feedback.
-
Yes. obmalloc has "freelist" in pool. It is very similar to "freelist" in mimalloc.
So "mimalloc has freelist" is not enough reason to say "Python don't need freelist before malloc/free". |
Beta Was this translation helpful? Give feedback.
-
Prototype version of a more principled approach to free lists shows a 1%speed up |
Beta Was this translation helpful? Give feedback.
-
Extending this beyond ints and floats is a pain. Adopting an allocator, such as |
Beta Was this translation helpful? Give feedback.
-
How much impact that replace pymalloc with mimalloc on the binary size? |
Beta Was this translation helpful? Give feedback.
-
Following up on my earlier comment that "Extending this beyond ints and floats is a pain"... The block to using freelists universally is the complication of handling Python objects instead of chunks of memory. We need a free function that takes the size of the object as well as the object. This is more error-prone than a free function that just takes the object. |
Beta Was this translation helpful? Give feedback.
-
Answering Eric's questions in #132 (comment):
Speed. We can inline allocation and de-allocation, and for fixed sized objects that means we can access the freelist with a single memory read.
Experimentation, or statistics. The current sizes have not be finely tuned, or tuned at all AFAICT.
By allocating chunks of memory instead of Python objects, all types should use freelists.
Different types sharing freelists should improve performance as it will lead to fewer allocations and better locality.
If we allocate memory, not objects, then all we need to do is free the memory in the freelists.
All classes should use free lists implicitly. |
Beta Was this translation helpful? Give feedback.
-
We should probably re-assess whether we're going with mimalloc -- if we are, we probably shouldn't put separate effort in this (except getting rid of existing one-off freelists). The work on this seems stalled despite good results (maybe because @tiran has been missing lately). We should also try to guess what the future of the GIL is. If it's going away (nogil, which may or may not happen), we need thread-safe freelists (which mimalloc provides, but which a simple custom scheme doesn't). If we're going to have GIL-per-interpreter (which seems likely) we'll at least have to have a separate set of freelists per interpreter. |
Beta Was this translation helpful? Give feedback.
-
Whether or not we use mimalloc is largely irrelevant. We should still improve the freelists. With better freelists, the allocation performance of the underlying allocator shouldn't matter much. As for the future of the GIL, who knows? But let's not make 3.12 slower based on maybes.
We already do. https://github.com/python/cpython/blob/main/Include/internal/pycore_interp.h#L167 |
Beta Was this translation helpful? Give feedback.
-
We currently have many ad-hoc per class free-lists.
We should consider changing freelists from per-class to per-size.
My informal experiments show that currently about 50% of allocations are served from free lists, whereas with per-size freelists, this increased to about 90%.
With tuning, we should expect a hit rate well over 90%.
Using free lists for blocks of memory, instead of objects, means that there are no interactions between the free lists and the cycle GC, simplifying both. There is a small additional cost of re-assigning the class, but the cost of this will be negligible in most cases.
Design principles
With the above, allocation becomes:
The functions mentioned above must be small and pure so that:
In practice, we will probably inline most of the above function calls, it just helps to think of the parts separately when designing.
Implementation of the freelist.
One possible design of the free lists is:
Each free list consists of three elements.
For alignment reasons the integers should be half the size of a pointer, which limits the capacity to 64k on a 32 bit machine, which is plenty.
The actual list is implemented as a linked list through the blocks of memory themselves, treating the first word of the memory as a pointer to the next entry.
NULL
and the remaining space to zero, no allocations will be from the free list. Should tracemalloc be turned off, the remaining space can be set back to the capacity.freelist->remaining == 0
andfreelist->head == NULL
, respectively.new_entry->next = freelist->head; freelist->head = new_entry; freelist->remaining--;
result = freelist->head; freelist->head = result->next; freelist->remaining++;
By putting the free lists in an array, the mapping from size class to free list is very simple.
Behavior when free list is full or empty; interaction with the malloc implementation.
This simplest option is to call malloc when the list is empty and to call free when the list is full.
Unfortunately this leads to fragmentation and performs poorly when repeatedly allocating or deallocating.
To avoid fragmentation, we should clear the free list when it is full and we need to free another object.
To avoid trashing on alternating malloc/frees we should half-fill the list when it is empty and we need to allocate an object.
Doing bulk free/mallocs reduces the number of free list misses considerably and keeps fragmentation down.
Ideally the malloc implementation would have functions for freeing and allocating multiple objects of the same size, which they generally don't. We could add that feature to
PyMalloc
.Beta Was this translation helpful? Give feedback.
All reactions