Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iobuf: update the allocator schedule #24475

Merged

Conversation

travisdowns
Copy link
Member

iobuf uses a growing schedule of allocations (growth factor 1.5) for each fragment. This schedule was not aware of the seastar allocation sizes, so it would request, for example 66K allocation, but seastar internally rounds that up to 128K, so for that allocation we waste ~50% of the memory which is invisible to iobuf.

In this change we update the schedule to be seastar allocator aware, i.e., using the same ~1.5x growth factor, but rounding up to the next seastar allocator boundary. At 16K or below, these boundaries are log-linear: every 2^n size, plus 3 evenly spread sizes in between each power of 2. Above 16K, sizes are 2^n.

This change slightly reduces the total number of steps until we get to the max size of 128K, as from 32K to 128K we use doubling steps instead of 1.5. For large iobufs this results in 1 fewer total which is why some tests which tested the exact number of fragments needed to be decreased by 1.

Fixes CORE-8478.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

12288,
16384,
32768,
65536,
131072});
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table is similar to the old one but "zero waste". At <= 16K, the differences are very small, i.e., the entries can be matched up 1:1 with only slight differences to avoid waste.

Above 16K the new table doubles, since the allocator doubles, so sizes like 90K are just waste compared to 128K (i.e., the underlying alloc in both cases is 128K).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could save even more waste by only using small pool sizes that divide pages cleanly but it's probably not worth it as the wastage-percent goes to zero once you seriously start using that size.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StephanDollberg that's only 2^n sizes, I think? Small pool always allocates spans, spans are always 2^n, and only power of 2 can exactly divide other powers of two (because the prime factorization of powers of 2 only have "2s", and so must anything that divides evenly into it).

So we couldn't do that and preserve 1.5x growth factor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The waste for non power of two can still be very small, however, and does vary across sizes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's only 2^n sizes

Yeah, don't think it's worth addressing as the wastage should be negligible.

16384,
32768,
65536,
131072,
131072,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test just tests that passing size at position N in the table results in size N+1 when calling next alloc size. So the table is copy/paste from the impl, and the second table is the same as the first with the first entry deleted (shifted up by 1).

I also include an additional row in the table testing the case that 128K maps to 128K unchanged.

@@ -379,38 +379,36 @@ SEASTAR_THREAD_TEST_CASE(iobuf_as_ostream) {
}

SEASTAR_THREAD_TEST_CASE(alloctor_forward_progress) {
static constexpr std::array<uint32_t, 14> src = {{
static constexpr auto src = std::to_array<uint32_t>({
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No counting needed in C++20.

@@ -429,7 +427,7 @@ SEASTAR_THREAD_TEST_CASE(test_next_chunk_allocation_append_temp_buf) {
}
// slow but tha'ts life.
auto distance = std::distance(buf.begin(), buf.end());
BOOST_REQUIRE_EQUAL(distance, 324);
BOOST_REQUIRE_EQUAL(distance, 323);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For very large iobufs the new schedule results in ~1 less fragment since it grow slightly faster in the 32K -> 128K range, and these tests assert the exact fragment count.

iobuf uses a growing schedule of allocations (growth factor 1.5) for
each fragment. This schedule was not aware of the seastar allocation
sizes, so it would request, for example 66K allocation, but seastar
internally rounds that up to 128K, so for that allocation we waste
~50% of the memory which is invisible to iobuf.

In this change we update the schedule to be seastar allocator aware,
i.e., using the same ~1.5x growth factor, but rounding up to the next
seastar allocator boundary. At 16K or below, these boundaries are
log-linear: every 2^n size, plus 3 evenly spread sizes in between
each power of 2. Above 16K, sizes are 2^n.

This change slightly reduces the total number of steps until we get
to the max size of 128K, as from 32K to 128K we use doubling steps
instead of 1.5. For large iobufs this results in 1 fewer total
which is why some tests which tested the exact number of fragments
needed to be decreased by 1.

Fixes CORE-8478.
@travisdowns travisdowns force-pushed the td-better-iobuf-alloc-schedule branch from 9c0c302 to 7deb38b Compare December 6, 2024 15:48
12288,
16384,
32768,
65536,
131072});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could save even more waste by only using small pool sizes that divide pages cleanly but it's probably not worth it as the wastage-percent goes to zero once you seriously start using that size.

@vbotbuildovich
Copy link
Collaborator

the below tests from https://buildkite.com/redpanda/redpanda/builds/59386#01939ca9-8d6e-41b5-8e82-6f1c9027fafd have failed and will be retried

storage_e2e_single_thread_rpunit

@vbotbuildovich
Copy link
Collaborator

@travisdowns travisdowns merged commit 8d90667 into redpanda-data:dev Dec 9, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants