-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pageserver: reorder upload queue when possible #10218
Conversation
11ee986
to
b457d21
Compare
7330 tests run: 6959 passed, 0 failed, 371 skipped (full report)Flaky tests (2)Postgres 17
Postgres 14
Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
11a217e at 2025-01-14T14:45:52.163Z :recycle: |
b457d21
to
c2215f1
Compare
fdb6fbb
to
b032c18
Compare
4eea68f
to
c09b6a7
Compare
91be562
to
dad6139
Compare
76d75e5
to
e207145
Compare
Added a benchmark measuring the It isn't clear that we'll hit this number of in-progress operations under realistic conditions, so this is probably fine. |
e207145
to
419019e
Compare
419019e
to
e91e7e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went over it a few times. Looks correct and the tests are convincing. Perhaps worth a second pair of eyes though.
@skyzh Want to give this another look-over? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall LGTM.
The patch is still bottlenecked by two things: flush deletion queue (as mentioned in #10283) and reorder index part uploads. These could be addressed in future patches.
The patch allows uploads to skip the queue if they are not yet in index_part yet, which means that we can upload more aggressively, and that we would potentially get files not referenced in index_part in the remote storage. I'll update the storage scrubber check at some point.
Moving forward, we can of course modify the bypass algorithm to allow more parallelism (i.e., batch index part upload), but I wonder if things could get easier with the following ideas:
- Having a unique ID for each of the file (i.e., a monotonically increasing ID for each layer file) so that we don't need to deal with upload-delete race at all.
- Handle meta operations in the upload queue instead of atomic primitives like upload/delete. i.e., we only have a few operations in the layer file manager: L0 compaction, freeze/flush, GC. This means that the operations submitted to the upload queue are always in the form of "upload layers x N, deletion, update index part". Hence, upload/delete never needs to get reordered between between index_part uploads because deletes always get appended to the queue after uploads for each of the meta operation.
Can we also apply the env variable in CI and see if it passes all tests? |
Finally getting back to this. TFTRs!
This is already the case today -- we always have to upload the file before we upload the index. This won't become more common with this patch, since we don't delay index uploads more than we did before. If anything it might become less common, because index uploads no longer necessarily have to wait for all in-progress operations to complete first.
It's not really bottlenecked on the deletion queue race -- that race already exists today, but we can more easily fix that race now. And I think we'll have to do something similar to the current deletion queue flushing even if we do address it. See #10248 for index upload coalescing.
Yeah, immutable files are generally a much better idea. I'd be +1 on doing that instead at some point, and making path conflicts an error condition.
I think I'd prefer smaller atomic primitives -- it's a smaller set of cases to deal with, and it's easier to reason about the interactions. That scheme also wouldn't address the motivating case here: layer flushes currently upload layer/index/layer/index, and we want to upload the layers in parallel.
Good idea, I'll open a draft PR for a CI run. |
Split this out to a follow-up PR: #10384. |
All tests passed on #10385, let's see if this thing flies. |
## Problem With upload queue reordering in #10218, we can easily get into a situation where multiple index uploads are queued back to back, which can't be parallelized. This will happen e.g. when multiple layer flushes enqueue layer/index/layer/index/... and the layers skip the queue and are uploaded in parallel. These index uploads will incur serial S3 roundtrip latencies, and may block later operations. Touches #10096. ## Summary of changes When multiple back-to-back index uploads are ready to upload, only upload the most recent index and drop the rest.
Problem
The upload queue currently sees significant head-of-line blocking. For example, index uploads act as upload barriers, and for every layer flush we schedule a layer and index upload, which effectively serializes layer uploads.
Resolves #10096.
Summary of changes
Allow upload queue operations to bypass the queue if they don't conflict with preceding operations, increasing parallelism.
NB: the upload queue currently schedules an explicit barrier after every layer flush as well (see #8550). This must be removed to enable parallelism. This will require a better mechanism for compaction backpressure, see e.g. #8390 or #5415.