Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: reorder upload queue when possible #10218

Merged
merged 2 commits into from
Jan 14, 2025
Merged

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Dec 20, 2024

Problem

The upload queue currently sees significant head-of-line blocking. For example, index uploads act as upload barriers, and for every layer flush we schedule a layer and index upload, which effectively serializes layer uploads.

Resolves #10096.

Summary of changes

Allow upload queue operations to bypass the queue if they don't conflict with preceding operations, increasing parallelism.

NB: the upload queue currently schedules an explicit barrier after every layer flush as well (see #8550). This must be removed to enable parallelism. This will require a better mechanism for compaction backpressure, see e.g. #8390 or #5415.

Copy link

github-actions bot commented Dec 20, 2024

7330 tests run: 6959 passed, 0 failed, 371 skipped (full report)


Flaky tests (2)

Postgres 17

Postgres 14

Code coverage* (full report)

  • functions: 32.9% (8143 of 24756 functions)
  • lines: 48.2% (68254 of 141718 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
11a217e at 2025-01-14T14:45:52.163Z :recycle:

@skyzh skyzh self-requested a review December 20, 2024 22:56
@erikgrinaker erikgrinaker force-pushed the erik/upload-reorder branch 4 times, most recently from fdb6fbb to b032c18 Compare December 27, 2024 16:38
@erikgrinaker erikgrinaker changed the base branch from main to erik/assert-upload-index December 30, 2024 09:38
@erikgrinaker erikgrinaker force-pushed the erik/upload-reorder branch 2 times, most recently from 4eea68f to c09b6a7 Compare December 30, 2024 14:17
@erikgrinaker erikgrinaker force-pushed the erik/upload-reorder branch 9 times, most recently from 91be562 to dad6139 Compare January 2, 2025 14:46
@erikgrinaker erikgrinaker force-pushed the erik/upload-reorder branch 2 times, most recently from 76d75e5 to e207145 Compare January 3, 2025 11:01
@erikgrinaker
Copy link
Contributor Author

Added a benchmark measuring the next_ready() time for an UploadMetadata with a given number of Delete tasks already in progress. This shows a linear cost of 1 ms per 10,000 in-progress operations (per queued task). 85% of this is due to LayerName::hash for IndexPart::layer_metadata lookups. We can estimate that the common case of layer uploads/deletes only use 1 ms per 100,000 in-progress operations.

It isn't clear that we'll hit this number of in-progress operations under realistic conditions, so this is probably fine.

Base automatically changed from erik/assert-upload-index to main January 3, 2025 16:04
@erikgrinaker erikgrinaker marked this pull request as ready for review January 4, 2025 09:08
@erikgrinaker erikgrinaker requested a review from a team as a code owner January 4, 2025 09:08
@erikgrinaker erikgrinaker requested a review from VladLazar January 4, 2025 09:12
Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went over it a few times. Looks correct and the tests are convincing. Perhaps worth a second pair of eyes though.

@erikgrinaker
Copy link
Contributor Author

@skyzh Want to give this another look-over?

Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM.

The patch is still bottlenecked by two things: flush deletion queue (as mentioned in #10283) and reorder index part uploads. These could be addressed in future patches.

The patch allows uploads to skip the queue if they are not yet in index_part yet, which means that we can upload more aggressively, and that we would potentially get files not referenced in index_part in the remote storage. I'll update the storage scrubber check at some point.

Moving forward, we can of course modify the bypass algorithm to allow more parallelism (i.e., batch index part upload), but I wonder if things could get easier with the following ideas:

  • Having a unique ID for each of the file (i.e., a monotonically increasing ID for each layer file) so that we don't need to deal with upload-delete race at all.
  • Handle meta operations in the upload queue instead of atomic primitives like upload/delete. i.e., we only have a few operations in the layer file manager: L0 compaction, freeze/flush, GC. This means that the operations submitted to the upload queue are always in the form of "upload layers x N, deletion, update index part". Hence, upload/delete never needs to get reordered between between index_part uploads because deletes always get appended to the queue after uploads for each of the meta operation.

pageserver/src/tenant/upload_queue.rs Show resolved Hide resolved
pageserver/src/tenant/upload_queue.rs Show resolved Hide resolved
@skyzh
Copy link
Member

skyzh commented Jan 6, 2025

Can we also apply the env variable in CI and see if it passes all tests?

@erikgrinaker
Copy link
Contributor Author

Finally getting back to this. TFTRs!

we would potentially get files not referenced in index_part in the remote storage

This is already the case today -- we always have to upload the file before we upload the index. This won't become more common with this patch, since we don't delay index uploads more than we did before. If anything it might become less common, because index uploads no longer necessarily have to wait for all in-progress operations to complete first.

The patch is still bottlenecked by two things: flush deletion queue (as mentioned in #10283) and reorder index part uploads.

It's not really bottlenecked on the deletion queue race -- that race already exists today, but we can more easily fix that race now. And I think we'll have to do something similar to the current deletion queue flushing even if we do address it.

See #10248 for index upload coalescing.

Having a unique ID for each of the file (i.e., a monotonically increasing ID for each layer file) so that we don't need to deal with upload-delete race at all.

Yeah, immutable files are generally a much better idea. I'd be +1 on doing that instead at some point, and making path conflicts an error condition.

Handle meta operations in the upload queue instead of atomic primitives like upload/delete

I think I'd prefer smaller atomic primitives -- it's a smaller set of cases to deal with, and it's easier to reason about the interactions.

That scheme also wouldn't address the motivating case here: layer flushes currently upload layer/index/layer/index, and we want to upload the layers in parallel.

Can we also apply the env variable in CI and see if it passes all tests?

Good idea, I'll open a draft PR for a CI run.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jan 14, 2025

Can we also apply the env variable in CI and see if it passes all tests?

Good idea, I'll open a draft PR for a CI run.

#10385

I'm also going to add a limit for the number of in-progress items here, and set it to the remote storage concurrency (100), since it's pointless to schedule more tasks that this. That will put a bound on the complexity in most common cases

Split this out to a follow-up PR: #10384.

@erikgrinaker
Copy link
Contributor Author

All tests passed on #10385, let's see if this thing flies.

@erikgrinaker erikgrinaker added this pull request to the merge queue Jan 14, 2025
Merged via the queue into main with commit ffaa52f Jan 14, 2025
85 checks passed
@erikgrinaker erikgrinaker deleted the erik/upload-reorder branch January 14, 2025 16:32
github-merge-queue bot pushed a commit that referenced this pull request Jan 14, 2025
## Problem

With upload queue reordering in #10218, we can easily get into a
situation where multiple index uploads are queued back to back, which
can't be parallelized. This will happen e.g. when multiple layer flushes
enqueue layer/index/layer/index/... and the layers skip the queue and
are uploaded in parallel.

These index uploads will incur serial S3 roundtrip latencies, and may
block later operations.

Touches #10096.

## Summary of changes

When multiple back-to-back index uploads are ready to upload, only
upload the most recent index and drop the rest.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pageserver: improve flush upload queue parallelism
4 participants