Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-126868: Add freelist for compact int objects #126865

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

eendebakpt
Copy link
Contributor

@eendebakpt eendebakpt commented Nov 15, 2024

We can add freelists for the int object to improve performance. Using the new methods from #121934 the amount of code needed for adding a freelist is quite small. We only implement the freelist for compact ints (e.g. a single digit). For multi-digit int objects adding freelists is more complex (we need a size-based freelist) and the gains are smaller (for very large int objects the allocation is not a significant part of the computation time)

To be done:

[ ] The freelist size was chosen to be 100 (equal to the freelist size of float), but perhaps this can be tuned better
[ ] Re-run pyperformance benchmarks
[ ] The long_dealloc contained special casing to avoid deallocating small ints. These are immortal now (with fixed refcount value), so we can perhaps remove that code

Some references to discussions on freelists

The freelist improves performance of int operations in microbenchmarks:

bench_long: Mean +- std dev: [main_long] 106 ns +- 5 ns -> [pr_long1c] 99.8 ns +- 4.4 ns: 1.07x faster
bench_alloc: Mean +- std dev: [main_long] 210 us +- 6 us -> [pr_long1c] 177 us +- 10 us: 1.19x faster

Benchmark hidden because not significant (1): bench_collatz

Geometric mean: 1.08x faster
Benchmark script
# Quick benchmark for cpython long objects

import pyperf


def collatz(a):
    while a > 1:
        if a % 2 == 0:
            a = a // 2
        else:
            a = 3 * a + 1


def bench_collatz(loops):
    range_it = range(loops)
    t0 = pyperf.perf_counter()
    for ii in range_it:
        collatz(ii)
    return pyperf.perf_counter() - t0


def bench_long(loops):
    range_it = range(loops)
    t0 = pyperf.perf_counter()
    x = 10
    for ii in range_it:
        x = x * x
        y = x // 2
        x = y + ii + x
        if x > 10**10:
            x = x % 1000
    return pyperf.perf_counter() - t0


def bench_alloc(loops):
    range_it = range(loops)
    t0 = pyperf.perf_counter()
    for ii in range_it:
        for kk in range(20_000):
            del kk
    return pyperf.perf_counter() - t0


# %timeit bench_long(1000)

if __name__ == "__main__":
    runner = pyperf.Runner()
    runner.bench_time_func("bench_collatz", bench_collatz)
    runner.bench_time_func("bench_long", bench_long)
    runner.bench_time_func("bench_alloc", bench_alloc)

On the pyperformance test suite (actually, a subset of the suite, not all benchmarks run on my system) shows the percentage of successfull freelist allocations increases significantly

Main:

Allocations from freelist 	2,004,971,371 	39.8%
Frees to freelist 	2,005,350,418 	
Allocations 	3,034,877,938 	60.2%
Allocations to 512 bytes 	3,008,791,812 	59.7%
Allocations to 4 kbytes 	18,648,072 	0.4%
Allocations over 4 kbytes 	7,438,054 	0.1%
Frees 	3,142,033,922

PR

Allocations from freelist 	3,058,347,887 	58.6%
Frees to freelist 	3,058,576,117 	
Allocations 	2,159,771,546 	41.4%
Allocations to 512 bytes 	2,133,373,693 	40.9%
Allocations to 4 kbytes 	18,802,328 	0.4%
Allocations over 4 kbytes 	7,595,525 	0.1%
Frees 	2,267,538,686

@eendebakpt eendebakpt changed the title Draft: Add freelist of compact int objects Draft: gh-126868: Add freelist for compact int objects Nov 15, 2024
@eendebakpt eendebakpt marked this pull request as draft November 15, 2024 12:50
@mdboom
Copy link
Contributor

mdboom commented Nov 15, 2024

I'm running this PR over pyperformance on our benchmarking hardware. It will take ~3 hours.

@mdboom
Copy link
Contributor

mdboom commented Nov 15, 2024

I'm running this PR over pyperformance on our benchmarking hardware. It will take ~3 hours.

Actually, scratch that -- I'll wait until the tests are passing here. That's required for PGO builds.

@eendebakpt
Copy link
Contributor Author

Tests are passing now, but I disabled returning objects to the freelist at a couple of places. Changing PyObject_Free to _PyLong_ExactDealloc on the lines of code marked with "needs to be converted to freelist" introduces refleaks. It seems related to #125323. @fidgetSpinner Do you perhaps have an idea what is going on?

@markshannon
Copy link
Member

What happens exactly when you change these lines:

PyStackRef_CLOSE_SPECIALIZED(left, (destructor)PyObject_Free); // needs to be converted to freelist

to

PyStackRef_CLOSE_SPECIALIZED(left, (destructor)_PyLong_ExactDealloc);

?

@eendebakpt
Copy link
Contributor Author

@markshannon When I change the lines several tests related to refleaks in the CI are failing. For example test_no_memleak, which can be reproduced from the command line with:

python -Xshowrefcount -c  "[x+x*x for x in range(1000)]"

The output should be [0 refs, 0 blocks], but instead I get [982 refs, 0 blocks] (exact numbers depending on which lines I change and the exact code executed).

Copy link
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "leaks" are due to the freelist itself. They are a sign the freelist is working :).
Can you please convert all to _PyLong_ExactDealloc and apply the suggestion below, and tell me how many allocations this removes?

Objects/longobject.c Show resolved Hide resolved
Objects/longobject.c Outdated Show resolved Hide resolved
@Fidget-Spinner
Copy link
Member

@eendebakpt sorry I pushed to your branch as I'm really eager to get benchmark results on this :).

@Fidget-Spinner
Copy link
Member

Fidget-Spinner commented Nov 20, 2024

On my machine using the benchmark script provided above (release build, no PGO, no LTO):

bench_collatz: Mean +- std dev: [without_freelist] 9.41 us +- 0.08 us -> [with_freelist] 9.22 us +- 0.08 us: 1.02x faster
bench_long: Mean +- std dev: [without_freelist] 187 ns +- 1 ns -> [with_freelist] 164 ns +- 2 ns: 1.14x faster
bench_alloc: Mean +- std dev: [without_freelist] 411 us +- 4 us -> [with_freelist] 333 us +- 2 us: 1.24x faster

Geometric mean: 1.13x faster

_Py_SetImmortal(self);
return;
}
}
}

if (PyLong_CheckExact(self)) {
if (_PyLong_IsCompact((PyLongObject *)self)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fidget-Spinner The _PyLong_IsCompact check has already been done in this method, can we move this into the if (pylong && _PyLong_IsCompact(pylong)) part?

Also, can we remove the pylong && part? I think pylong can never be NULL. (PyLong_CheckExact assumes the pointer is not NULL I think)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that sounds good

Copy link
Contributor

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, are compact integers allowed to be signed or not? (you mentioned that we only focus on single digit numbers but the C API considers compact objects as being an implementation detail IIRC). If not, how hard would it be to make them support free lists? (I don't have much knowledge in free lists; for instance it's a mystery to me for how the free list grows)

@@ -26,6 +26,7 @@
#include "pycore_pyerrors.h" // _PyErr_GetRaisedException()
#include "pycore_pystate.h" // _PyInterpreterState_GET()
#include "pycore_range.h" // _PyRangeIterObject
#include "pycore_long.h" // void _PyLong_ExactDealloc(PyLongObject *op);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we align the // (on mobile it seems 1 char off)?

@@ -6615,7 +6642,7 @@ PyTypeObject PyLong_Type = {
0, /* tp_init */
0, /* tp_alloc */
long_new, /* tp_new */
PyObject_Free, /* tp_free */
(freefunc)PyObject_Free, /* tp_free */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we align the .tp_free comment? (maybe tabs and spaces are mixed, hence that's why I see them unaligned on monile). Also, is the cast necessary to avoid UBSan failures? If not, you can just use .tp_free = ... to emphasize the semantic

Comment on lines 3641 to 3643
* we accidentally decref small Ints out of existence. Instead,
* since small Ints are immortal, re-set the reference count.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* we accidentally decref small Ints out of existence. Instead,
* since small Ints are immortal, re-set the reference count.
*/
* we accidentally decref small Ints out of existence. Instead,
* since small Ints are immortal, re-set the reference count.
*/

@@ -6,6 +6,7 @@
#include "pycore_bitutils.h" // _Py_popcount32()
#include "pycore_initconfig.h" // _PyStatus_OK()
#include "pycore_call.h" // _PyObject_MakeTpCall
#include "pycore_freelist.h" // _Py_FREELIST_FREE(), _Py_FREELIST_POP()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment alignment.

@@ -55,6 +55,8 @@ extern void _PyLong_FiniTypes(PyInterpreterState *interp);

/* other API */

PyAPI_FUNC(void) _PyLong_ExactDealloc(PyObject *self);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need it to be exported like that or could you live with a simple extern? If not, can you explain which file needs this export?

Copy link
Member

@Fidget-Spinner Fidget-Spinner Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's needed by the JIT on Windows, as it's used in bytecodes.c. This is a pretty common pattern in the internal C API unfortunately

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For semantic purposes, would it be better to have a macro named differently? (it would be a simple alias but it could help semantics and reviewers)

@@ -42,7 +43,7 @@ static inline void
_Py_DECREF_INT(PyLongObject *op)
{
assert(PyLong_CheckExact(op));
_Py_DECREF_SPECIALIZED((PyObject *)op, (destructor)PyObject_Free);
_Py_DECREF_SPECIALIZED((PyObject *)op, (destructor) _PyLong_ExactDealloc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_Py_DECREF_SPECIALIZED((PyObject *)op, (destructor) _PyLong_ExactDealloc);
_Py_DECREF_SPECIALIZED((PyObject *)op, (destructor)_PyLong_ExactDealloc);

I'm also not sure whether you need the cast. If you need the cast, I'd suggest changing the signature of the drstructor itself.

@eendebakpt eendebakpt changed the title Draft: gh-126868: Add freelist for compact int objects gh-126868: Add freelist for compact int objects Nov 21, 2024
@eendebakpt eendebakpt marked this pull request as ready for review November 21, 2024 11:39
@eendebakpt
Copy link
Contributor Author

Out of curiosity, are compact integers allowed to be signed or not? (you mentioned that we only focus on single digit numbers but the C API considers compact objects as being an implementation detail IIRC). If not, how hard would it be to make them support free lists? (I don't have much knowledge in free lists; for instance it's a mystery to me for how the free list grows)

The Python compact integers include both signed and unsigned integers. The concept is indeed an implementation detail, from the Python sides nothing should be visible. For the freelist implementation in this PR it is convenient that the compacts ints all have the same size and we have fast methods like _PyLong_IsCompact available for testing.

Including all ints in the freelist scheme would require a size based freelist. This is an interesting idea, see for example #101453 or faster-cpython/ideas#612. With that approach freelists between different objects of the same size could be shared.

_PyLong_ExactDealloc(PyObject *self)
{
if (_PyLong_IsCompact((PyLongObject *)self)) {
_Py_FREELIST_FREE(ints, self, PyObject_Free);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realised that we didnt check for if it's a smallint cached value here. It should be rejected there

Copy link
Member

@Fidget-Spinner Fidget-Spinner Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind. It's not needed because (theoretically) the refcount of a smallint never hits 0 due to it being immortal, so this code will never be called on it. Can you add an assert to be safe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also means my comment about the small ints was wrong. Could you remove that please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment has been adapted. If the refcount can never become zero, then the code in long_dealloc checking for small ints can be removed (perhaps also with an assert). Since this is something unrelated to the freelists I will make a separate PR for that and rebase this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fidget-Spinner I think your first remark was right after all and we do need to check for smallints (at least in the 32-bit stable ABI build). I will add the check once there is consensus on #127120 (which might very well be the status quo)


// We use _Py_FREELIST_POP_MEM instead of _Py_FREELIST_POP because the new
// reference is created in _PyObject_Init
PyLongObject *v = (PyLongObject *)_Py_FREELIST_POP_MEM(ints);
Copy link
Member

@markshannon markshannon Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you go back to _Py_FREELIST_POP and avoid reinitializing the object on line 237. We don't need to re-assign the ob_type field in objects taken from the freelist. Something like:

    PyLongObject *v = (PyLongObject *)_Py_FREELIST_POP(ints);
    if (v == NULL) {
        v = PyObject_Malloc(sizeof(PyLongObject));
        if (v == NULL) {
            PyErr_NoMemory();
            return NULL;
        }
        _PyObject_Init((PyObject*)v, &PyLong_Type);
    }
    digit abs_x = x < 0 ? -x : x;
    _PyLong_SetSignAndDigitCount(v, x<0?-1:1, 1);
    v->long_value.ob_digit[0] = abs_x;
    return (PyObject*)v;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants