-
Notifications
You must be signed in to change notification settings - Fork 477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proxy release 2025-01-23 #10486
Closed
Closed
Proxy release 2025-01-23 #10486
+4,630
−1,466
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…#10406) ## Problem As part of #8614 we need to pass options to START_WAL_PUSH. ## Summary of changes Add two options. `allow_timeline_creation`, default true, disables implicit timeline creation in the connection from compute. Eventually such creation will be forbidden completely, but as we migrate to configurations we need to support both: current mode and configurations enabled where creation by compute is disabled. `proto_version` specifies compute <-> sk protocol version. We have it currently in the first greeting package also, but I plan to change tag size from u64 to u8, which would make it hard to use. Command is more appropriate place for it anyway.
This should fix the largest source of flakyness of test_nbtree_pagesplit_cycleid. ## Problem #10390 ## Summary of changes By using a guaranteed-flushed LSN, we ensure that PS won't have to wait forever. (If it does wait forever, we know the issue can't be with Compute's WAL)
## Problem When multiple changes are grouped in a merge group to be merged as part of the merge queue, the changes might individually pass `check-codestyle-rust` but not in their combined form. ## Summary of changes - Move `check-codestyle-rust` into a reusable workflow that is called from it's previous location in `build_and_test.yml`, and additionally call it from `pre_merge_checks.yml`. The additional call does not run on ARM, only x86, to ensure the merge queue continues being responsive. - Trigger `pre_merge_checks.yml` on PRs that change any of the workflows running in `pre_merge_checks.yml`, so that we get feedback on those early an not only after trying to merge those changes.
## Problem Threads spawned in `test_tenant_delete_races_timeline_creation` are not joined before the test ends, and can generate `PytestUnhandledThreadExceptionWarning` in other tests. https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10419/12805365523/index.html#/testresult/53a72568acd04dbd ## Summary of changes - Wrap threads in ThreadPoolExecutor which will join them before the test ends - Remove a spurious deletion call -- the background thread doing deletion ought to succeed.
Rename the safekeeper scheduling policy "disabled" to "pause". A rename was requested in #10400 (comment), as the "disabled" policy is meant to be analogous to the "pause" policy for pageservers. Also simplify the `SkSchedulingPolicyArg::from_str` function, relying on the `from_str` implementation of `SkSchedulingPolicy`. Latter is used for the database format as well, so it is quite stable. If we ever want to change the UI, we'll need to duplicate the function again but this is cheap.
## Problem gc-compaction needs the partitioning data to decide the job split. This refactor allows concurrent access/computing the partitioning. ## Summary of changes Make `partitioning` an ArcSwap so that others can access the partitioning while we compute it. Fully eliminate the `repartition is called concurrently` warning when gc-compaction is going on. --------- Signed-off-by: Alex Chi Z <[email protected]>
…0422) ## Problem We were comparing serialized configs from the database with serialized configs from memory. If fields have been added/removed to TenantConfig, this generates spurious consistency errors. This is fine in test environments, but limits the usefulness of this debug API in the field. Closes: #10369 ## Summary of changes - Do a decode/encode cycle on the config before comparing it, so that it will have exactly the expected fields.
## Problem Node fills were limited to moving (total shards / node_count) shards. In systems that aren't perfectly balanced already, that leads us to skip migrating some of the shards that belong on this node, generating work for the optimizer later to gradually move them back. ## Summary of changes - Where a shard has a preferred AZ and is currently attached outside this AZ, then always promote it during fill, irrespective of target fill count
## Problem All pageserver have the same application name which makes it hard to distinguish them. ## Summary of changes Include the node id in the application name sent to the safekeeper. This should gives us more visibility in logs. There's a few metrics that will increase in cardinality by `pageserver_count`, but that's fine.
## Problem `test_storage_controller_node_deletion` sometimes failed because shards were moving around during timeline creation, and neon_local isn't tolerant of that. The movements were unexpected because the shards had only just been created. This was a regression from #9916 Closes: #10383 ## Summary of changes - Make this test use multiple AZs -- this makes the storage controller's scheduling reliably stable Why this works: in #9916 , I made a simplifying assumption that we would have multiple AZs to get nice stable scheduling -- it's much easier, because each tenant has a well defined primary+secondary location when they have an AZ preference and nodes have different AZs. Everything still works if you don't have multiple AZs, but you just have this quirk that sometimes the optimizer can disagree with initial scheduling, so once in a while a shard moves after being created -- annoying for tests, harmless IRL.
The extension now supports Postgres 17. The release also seems to be binary compatible with the previous version. Signed-off-by: Tristan Partin <[email protected]>
…pt open while flushing responses (#10386) # Refs - fixes #10309 - fixup of batching design, first introduced in #9851 - refinement of #8339 # Problem `Tenant::shutdown` was occasionally taking many minutes (sometimes up to 20) in staging and prod if the `page_service_pipelining.mode="concurrent-futures"` is enabled. # Symptoms The issue happens during shard migration between pageservers. There is page_service unavailability and hence effectively downtime for customers in the following case: 1. The source (state `AttachedStale`) gets stuck in `Tenant::shutdown`, waiting for the gate to close. 2. Cplane/Storcon decides to transition the target `AttachedMulti` to `AttachedSingle`. 3. That transition comes with a bump of the generation number, causing the `PUT .../location_config` endpoint to do a full `Tenant::shutdown` / `Tenant::attach` cycle for the target location. 4. That `Tenant::shutdown` on the target gets stuck, waiting for the gate to close. 5. Eventually the gate closes (`close completed`), correlating with a `page_service` connection handler logging that it's exiting because of a network error (`Connection reset by peer` or `Broken pipe`). While in (4): - `Tenant::shutdown` is stuck waiting for all `Timeline::shutdown` calls to complete. So, really, this is a `Timeline::shutdown` bug. - retries from Cplane/Storcon to complete above state transitions, fail with errors related to the tenant mgr slot being in state `TenantSlot::InProgress`, the tenant state being `TenantState::Stopping`, and the timelines being in `TimelineState::Stopping`, and the `Timeline::cancel` being cancelled. - Existing (and/or new?) page_service connections log errors `error reading relation or page version: Not found: Timed out waiting 30s for tenant active state. Latest state: None` # Root-Cause After a lengthy investigation ([internal write-up](https://www.notion.so/neondatabase/2025-01-09-batching-deadlock-Slow-Log-Analysis-in-Staging-176f189e00478050bc21c1a072157ca4?pvs=4)) I arrived at the following root cause. The `spsc_fold` channel (`batch_tx`/`batch_rx`) that connects the Batcher and Executor stages of the pipelined mode was storing a `Handle` and thus `GateGuard` of the Timeline that was not shutting down. The design assumption with pipelining was that this would always be a short transient state. However, that was incorrect: the Executor was stuck on writing/flushing an earlier response into the connection to the client, i.e., socket write being slow because of TCP backpressure. The probable scenario of how we end up in that case: 1. Compute backend process sends a continuous stream of getpage prefetch requests into the connection, but never reads the responses (why this happens: see Appendix section). 2. Batch N is processed by Batcher and Executor, up to the point where Executor starts flushing the response. 3. Batch N+1 is procssed by Batcher and queued in the `spsc_fold`. 4. Executor is still waiting for batch N flush to finish. 5. Batcher eventually hits the `TimeoutReader` error (10min). From here on it waits on the `spsc_fold.send(Err(QueryError(TimeoutReader_error)))` which will never finish because the batch already inside the `spsc_fold` is not being read by the Executor, because the Executor is still stuck in the flush. (This state is not observable at our default `info` log level) 6. Eventually, Compute backend process is killed (`close()` on the socket) or Compute as a whole gets killed (probably no clean TCP shutdown happening in that case). 7. Eventually, Pageserver TCP stack learns about (6) through RST packets and the Executor's flush() call fails with an error. 8. The Executor exits, dropping `cancel_batcher` and its end of the spsc_fold. This wakes Batcher, causing the `spsc_fold.send` to fail. Batcher exits. The pipeline shuts down as intended. We return from `process_query` and log the `Connection reset by peer` or `Broken pipe` error. The following diagram visualizes the wait-for graph at (5) ```mermaid flowchart TD Batcher --spsc_fold.send(TimeoutReader_error)--> Executor Executor --flush batch N responses--> socket.write_end socket.write_end --wait for TCP window to move forward--> Compute ``` # Analysis By holding the GateGuard inside the `spsc_fold` open, the pipelining implementation violated the principle established in (#8339). That is, that `Handle`s must only be held across an await point if that await point is sensitive to the `<Handle as Deref<Target=Timeline>>::cancel` token. In this case, we were holding the Handle inside the `spsc_fold` while awaiting the `pgb_writer.flush()` future. One may jump to the conclusion that we should simply peek into the spsc_fold to get that Timeline cancel token and be sensitive to it during flush, then. But that violates another principle of the design from #8339. That is, that the page_service connection lifecycle and the Timeline lifecycles must be completely decoupled. Tt must be possible to shut down one shard without shutting down the page_service connection, because on that single connection we might be serving other shards attached to this pageserver. (The current compute client opens separate connections per shard, but, there are plans to change that.) # Solution This PR adds a `handle::WeakHandle` struct that does _not_ hold the timeline gate open. It must be `upgrade()`d to get a `handle::Handle`. That `handle::Handle` _does_ hold the timeline gate open. The batch queued inside the `spsc_fold` only holds a `WeakHandle`. We only upgrade it while calling into the various `handle_` methods, i.e., while interacting with the `Timeline` via `<Handle as Deref<Target=Timeline>>`. All that code has always been required to be (and is!) sensitive to `Timeline::cancel`, and therefore we're guaranteed to bail from it quickly when `Timeline::shutdown` starts. We will drop the `Handle` immediately, before we start `pgb_writer.flush()`ing the responses. Thereby letting go of our hold on the `GateGuard`, allowing the timeline shutdown to complete while the page_service handler remains intact. # Code Changes * Reproducer & Regression Test * Developed and proven to reproduce the issue in #10399 * Add a `Test` message to the pagestream protocol (`cfg(feature = "testing")`). * Drive-by minimal improvement to the parsing code, we now have a `PagestreamFeMessageTag`. * Refactor `pageserver/client` to allow sending and receiving `page_service` requests independently. * Add a Rust helper binary to produce situation (4) from above * Rationale: (4) and (5) are the same bug class, we're holding a gate open while `flush()`ing. * Add a Python regression test that uses the helper binary to demonstrate the problem. * Fix * Introduce and use `WeakHandle` as explained earlier. * Replace the `shut_down` atomic with two enum states for `HandleInner`, wrapped in a `Mutex`. * To make `WeakHandle::upgrade()` and `Handle::downgrade()` cache-efficient: * Wrap the `Types::Timeline` in an `Arc` * Wrap the `GateGuard` in an `Arc` * The separate `Arc`s enable uncontended cloning of the timeline reference in `upgrade()` and `downgrade()`. If instead we were `Arc<Timeline>::clone`, different connection handlers would be hitting the same cache line on every upgrade()/downgrade(), causing contention. * Please read the udpated module-level comment in `mod handle` module-level comment for details. # Testing & Performance The reproducer test that failed before the changes now passes, and obviously other tests are passing as well. We'll do more testing in staging, where the issue happens every ~4h if chaos migrations are enabled in storcon. Existing perf testing will be sufficient, no perf degradation is expected. It's a few more alloctations due to the added Arc's, but, they're low frequency. # Appendix: Why Compute Sometimes Doesn't Read Responses Remember, the whole problem surfaced because flush() was slow because Compute was not reading responses. Why is that? In short, the way the compute works, it only advances the page_service protocol processing when it has an interest in data, i.e., when the pagestore smgr is called to return pages. Thus, if compute issues a bunch of requests as part of prefetch but then it turns out it can service the query without reading those pages, it may very well happen that these messages stay in the TCP until the next smgr read happens, either in that session, or possibly in another session. If there’s too many unread responses in the TCP, the pageserver kernel is going to backpressure into userspace, resulting in our stuck flush(). All of this stems from the way vanilla Postgres does prefetching and "async IO": it issues `fadvise()` to make the kernel do the IO in the background, buffering results in the kernel page cache. It then consumes the results through synchronous `read()` system calls, which hopefully will be fast because of the `fadvise()`. If it turns out that some / all of the prefetch results are not needed, Postgres will not be issuing those `read()` system calls. The kernel will eventually react to that by reusing page cache pages that hold completed prefetched data. Uncompleted prefetch requests may or may not be processed -- it's up to the kernel. In Neon, the smgr + Pageserver together take on the role of the kernel in above paragraphs. In the current implementation, all prefetches are sent as GetPage requests to Pageserver. The responses are only processed in the places where vanilla Postgres would do the synchronous `read()` system call. If we never get to that, the responses are queued inside the TCP connection, which, once buffers run full, will backpressure into Pageserver's sending code, i.e., the `pgb_writer.flush()` that was the root cause of the problems we're fixing in this PR.
Instead of generating our own request ID, we can just use the one provided by the control plane. In the event, we get a request from a client which doesn't set X-Request-ID, then we just generate one which is useful for tracing purposes. Signed-off-by: Tristan Partin <[email protected]>
## Problem 871e8b3 failed CI on main because a job ran to soon. This was caused by ea84ec3. While `promote-images-dev` does not inherently need `neon-image`, a few jobs depending on `promote-images-dev` do need it, and previously had it when it was `promote-images`, which depended on `test-images`, which in turn depended on `neon-image`. ## Summary of changes To ensure jobs depending `docker.io/neondatabase/neon` images get them, `promote-images-dev` gets the dependency to `neon-image` back which it previously had transitively through `test-images`.
## Problem Initdb observability is poor. ## Summary of changes Add some metrics so we can figure out which part, if any, is slow. Closes #10423
Add an endpoint to obtain the utilization of a safekeeper. Future changes to the storage controller can use this endpoint to find the most suitable safekeepers for newly created timelines, analogously to how it's done for pageservers already. Initially we just want to assign by timeline count, then we can iterate from there. Part of #9011
## Problem Occasionally, we encounter bugs in test environments that can be detected at the point of uploading an index, but we proceed to upload it anyway and leave a tenant in a broken state that's awkward to handle. ## Summary of changes - Validate index when submitting it for upload, so that we can see the issue quickly e.g. in an API invoking compaction - Validate index before executing the upload, so that we have a hard enforcement that any code path that tries to upload an index will not overwrite a valid index with an invalid one.
## Problem Since #9916 , the chaos code is actively fighting the optimizer: tenants tend to be attached in their preferred AZ, so most chaos migrations were moving them to a non-preferred AZ. ## Summary of changes - When picking migrations, prefer to migrate things _toward_ their preferred AZ when possible. Then pick shards to move the other way when necessary. The resulting behavior should be an alternating "back and forth" where the chaos code migrates thiings away from home, and then migrates them back on the next iteration. The side effect will be that the chaos code actively helps to push things into their home AZ. That's not contrary to its purpose though: we mainly just want it to continuously migrate things to exercise migration+notification code.
The other test focus on the external interface usage while the tests added in this PR add some testing around HandleInner's lifecycle, ensuring we don't leak it once either connection gets dropped or per-timeline-state is shut down explicitly.
## Summary Whereas currently we send all WAL to all pageserver shards, and each shard filters out the data that it needs, in this RFC we add a mechanism to filter the WAL on the safekeeper, so that each shard receives only the data it needs. This will place some extra CPU load on the safekeepers, in exchange for reducing the network bandwidth for ingesting WAL back to scaling as O(1) with shard count, rather than O(N_shards). Touches #9329. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Vlad Lazar <[email protected]> Co-authored-by: Vlad Lazar <[email protected]>
…imeline shutdown (#10445) ## Refs - fixes #10444 ## Problem We're seeing a panic `handles are only shut down once in their lifetime` in our performance testbed. ## Hypothesis Annotated code in #10444 (comment). ``` T1: drop Cache, executes up to (1) => HandleInner is now in state ShutDown T2: Timeline::shutdown => PerTimelineState::shutdown executes shutdown() again => panics ``` Likely this snuck in the final touches of #10386 where I narrowed down the locking rules. ## Summary of changes Make duplicate shutdowns a no-op.
Otherwise we might hit ERRORs in otherwise safe situations (such as user queries), which isn't a great user experience. ## Problem #10376 ## Summary of changes Instead of accepting internal errors as acceptable, we ensure we don't exceed our allocated usage.
## Problem Part of #9516 per RFC at #10412 ## Summary of changes Adding the necessary config items and index_part items for the large relation count work. --------- Signed-off-by: Alex Chi Z <[email protected]>
This simplifies the code in `pageserver_physical_gc` a little bit after the feedback in #10007 that the code is too complicated. Most importantly, we don't pass around `GcSummary` any more in a complicated fashion, and we save on async stream-combinator-inception in one place in favour of `try_stream!{}`. Follow-up of #10007
## Problem When releasing `release-7574`, the Github Release creation failed with "body is too long" (see https://github.com/neondatabase/neon/actions/runs/12834025431/job/35792346745#step:5:77). There's lots of room for improvement of the release notes, but for now we'll disable them instead. ## Summary of changes - Disable automatic generation of release notes for Github releases - Enable creation of Github releases for proxy/compute
) We currently have some flakiness in `test_scrubber_physical_gc_ancestors`, see #10391. The first flakiness kind is about the reconciler not actually becoming idle within the timeout of 30 seconds. We see continuous forward progress so this is likely not a hang. We also see this happen in parallel to a test failure, so is likely due to runners being overloaded. Therefore, we increase the timeout. The second flakiness kind is an assertion failure. This one is a little bit more tricky, but we saw in the successful run that there was some advance of the lsn between the compaction ran (which created layer files) and the gc run. Apparently gc rejects reductions to the single image layer setting if the cutoff lsn is the same as the lsn of the image layer: it will claim that that layer is newer than the space cutoff and therefore skip it, while thinking the old layer (that we want to delete) is the latest one (so it's not deleted). We address the second flakiness kind by inserting a tiny amount of WAL between the compaction and gc. This should hopefully fix things. Related issue: #10391 (not closing it with the merger of the PR as we'll need to validate that these changes had the intended effect). Thanks to Chi for going over this together with me in a call.
We did not have any tests on fast_import binary yet. In this PR I have introduced: - `FastImport` class and tools for testing in python - basic test that runs fast import against vanilla postgres and checks that data is there Should be merged after #10251
## Problem It's sometimes useful to obtain the elapsed duration from a `StorageTimeMetricsTimer` for purposes beyond just recording it in metrics (e.g. to log it). Extracted from #10405. ## Summary of changes Add `StorageTimeMetricsTimer.elapsed()` and return the duration from `stop_and_record()`.
## Problem For compaction backpressure, we need a mechanism to signal when compaction has reduced the L0 delta layer count below the backpressure threshold. Extracted from #10405. ## Summary of changes Add `LayerMap::watch_level0_deltas()` which returns a `tokio::sync::watch::Receiver` signalling the current L0 delta layer count.
## Problem Currently, the layer flush loop will continue flushing layers as long as any are pending, and only notify waiters once there are no further layers to flush. This can cause waiters to wait longer than necessary, and potentially starve them if pending layers keep arriving faster than they can be flushed. The impact of this will increase when we add compaction backpressure and propagate it up into the WAL receiver. Extracted from #10405. ## Summary of changes Break out of the layer flush loop once we've flushed up to the requested LSN. If further flush requests have arrived in the meanwhile, flushing will resume immediately after.
To help facilitate an upgrade to axum 0.8 (#10332 (review)) this massages the tonic dependency features so that tonic does not depend on axum.
…10472) ## Problem PR #9993 was supposed to enable `page_service_pipelining` by default for all `NeonEnv`s, but this was ineffective in our CI environment. Thus, CI Python-based tests and benchmarks, unless explicitly configuring pipelining, were still using serial protocol handling. ## Analysis The root cause was that in our CI environment, `config.compatibility_neon_binpath` is always Truthy. It's not in local environments, which is why this slipped through in local testing. Lesson: always add a log line ot pageserver startup and spot-check tests to ensure the intended default is picked up. ## Summary of changes Fix it. Since enough time has passed, the compatiblity snapshot contains a recent enough software version so we don't need to worry about `compatibility_neon_binpath` anymore. ## Future Work The question how to add a new default except for compatibliity tests, which is what the broken code was supposed to do, is still unsolved. Slack discussion: https://neondb.slack.com/archives/C059ZC138NR/p1737490501941309
# Refs - extracted from #9353 # Problem Before this PR, when task_mgr shutdown is signalled, e.g. during pageserver shutdown or Tenant shutdown, initial logical size calculation stops polling and drops the future that represents the calculation. This is against the current policy that we poll all futures to completion. This became apparent during development of concurrent IO which warns if we drop a `Timeline::get_vectored` future that still has in-flight IOs. We may revise the policy in the future, but, right now initial logical size calculation is the only part of the codebase that doesn't adhere to the policy, so let's fix it. ## Code Changes - make sensitive exclusively to `Timeline::cancel` - This should be sufficient for all cases of shutdowns; the sensitivity to task_mgr shutdown is unnecessary. - this broke the various cancel tests in `test_timeline_size.py`, e.g., `test_timeline_initial_logical_size_calculation_cancellation` - the tests would time out because the await point was not sensitive to cancellation - to fix this, refactor `pausable_failpoint` so that it accepts a cancellation token - side note: we _really_ should write our own failpoint library; maybe after we get heap-allocated RequestContext, we can plumb failpoints through there.
## Problem Both these versions are binary compatible, but the way pgvector structures the SQL files forbids installing 0.7.4 if you have a 0.8.0 distribution. Yet, some users may need a previous version for backward compatibility, e.g., restoring the dump. See this thread for discussion https://neondb.slack.com/archives/C04DGM6SMTM/p1735911490242919?thread_ts=1731343604.259169&cid=C04DGM6SMTM ## Summary of changes Put `vector--0.7.4.sql` file into compute image to allow installing this version as well. Tested on staging and it seems to be working as expected: ```sql select * from pg_available_extensions where name = 'vector'; name | default_version | installed_version | comment --------+-----------------+-------------------+------------------------------------------------------ vector | 0.8.0 | (null) | vector data type and ivfflat and hnsw access methods create extension vector version '0.7.4'; select * from pg_available_extensions where name = 'vector'; name | default_version | installed_version | comment --------+-----------------+-------------------+------------------------------------------------------ vector | 0.8.0 | 0.7.4 | vector data type and ivfflat and hnsw access methods alter extension vector update; select * from pg_available_extensions where name = 'vector'; name | default_version | installed_version | comment --------+-----------------+-------------------+------------------------------------------------------ vector | 0.8.0 | 0.8.0 | vector data type and ivfflat and hnsw access methods drop extension vector; create extension vector; select * from pg_available_extensions where name = 'vector'; name | default_version | installed_version | comment --------+-----------------+-------------------+------------------------------------------------------ vector | 0.8.0 | 0.8.0 | vector data type and ivfflat and hnsw access methods ``` If we find out it's a good approach, we can adopt the same for other extensions with a stable ABI -- support both `current` and `current - 1` releases.
## Refs - Epic: #9378 Co-authored-by: Vlad Lazar <[email protected]> Co-authored-by: Christian Schwarz <[email protected]> ## Problem The read path does its IOs sequentially. This means that if N values need to be read to reconstruct a page, we will do N IOs and getpage latency is `O(N*IoLatency)`. ## Solution With this PR we gain the ability to issue IO concurrently within one layer visit **and** to move on to the next layer without waiting for IOs from the previous visit to complete. This is an evolved version of the work done at the Lisbon hackathon, cf #9002. ## Design ### `will_init` now sourced from disk btree index keys On the algorithmic level, the only change is that the `get_values_reconstruct_data` now sources `will_init` from the disk btree index key (which is PS-page_cache'd), instead of from the `Value`, which is only available after the IO completes. ### Concurrent IOs, Submission & Completion To separate IO submission from waiting for its completion, while simultaneously feature-gating the change, we introduce the notion of an `IoConcurrency` struct through which IO futures are "spawned". An IO is an opaque future, and waiting for completions is handled through `tokio::sync::oneshot` channels. The oneshot Receiver's take the place of the `img` and `records` fields inside `VectoredValueReconstructState`. When we're done visiting all the layers and submitting all the IOs along the way we concurrently `collect_pending_ios` for each value, which means for each value there is a future that awaits all the oneshot receivers and then calls into walredo to reconstruct the page image. Walredo is now invoked concurrently for each value instead of sequentially. Walredo itself remains unchanged. The spawned IO futures are driven to completion by a sidecar tokio task that is separate from the task that performs all the layer visiting and spawning of IOs. That tasks receives the IO futures via an unbounded mpsc channel and drives them to completion inside a `FuturedUnordered`. (The behavior from before this PR is available through `IoConcurrency::Sequential`, which awaits the IO futures in place, without "spawning" or "submitting" them anywhere.) #### Alternatives Explored A few words on the rationale behind having a sidecar *task* and what alternatives were considered. One option is to queue up all IO futures in a FuturesUnordered that is polled the first time when we `collect_pending_ios`. Firstly, the IO futures are opaque, compiler-generated futures that need to be polled at least once to submit their IO. "At least once" because tokio-epoll-uring may not be able to submit the IO to the kernel on first poll right away. Second, there are deadlocks if we don't drive the IO futures to completion independently of the spawning task. The reason is that both the IO futures and the spawning task may hold some _and_ try to acquire _more_ shared limited resources. For example, both spawning task and IO future may try to acquire * a VirtualFile file descriptor cache slot async mutex (observed during impl) * a tokio-epoll-uring submission slot (observed during impl) * a PageCache slot (currently this is not the case but we may move more code into the IO futures in the future) Another option is to spawn a short-lived `tokio::task` for each IO future. We implemented and benchmarked it during development, but found little throughput improvement and moderate mean & tail latency degradation. Concerns about pressure on the tokio scheduler made us discard this variant. The sidecar task could be obsoleted if the IOs were not arbitrary code but a well-defined struct. However, 1. the opaque futures approach taken in this PR allows leaving the existing code unchanged, which 2. allows us to implement the `IoConcurrency::Sequential` mode for feature-gating the change. Once the new mode sidecar task implementation is rolled out everywhere, and `::Sequential` removed, we can think about a descriptive submission & completion interface. The problems around deadlocks pointed out earlier will need to be solved then. For example, we could eliminate VirtualFile file descriptor cache and tokio-epoll-uring slots. The latter has been drafted in neondatabase/tokio-epoll-uring#63. See the lengthy doc comment on `spawn_io()` for more details. ### Error handling There are two error classes during reconstruct data retrieval: * traversal errors: index lookup, move to next layer, and the like * value read IO errors A traversal error fails the entire get_vectored request, as before this PR. A value read error only fails that value. In any case, we preserve the existing behavior that once `get_vectored` returns, all IOs are done. Panics and failing to poll `get_vectored` to completion will leave the IOs dangling, which is safe but shouldn't happen, and so, a rate-limited log statement will be emitted at warning level. There is a doc comment on `collect_pending_ios` giving more code-level details and rationale. ### Feature Gating The new behavior is opt-in via pageserver config. The `Sequential` mode is the default. The only significant change in `Sequential` mode compared to before this PR is the buffering of results in the `oneshot`s. ## Code-Level Changes Prep work: * Make `GateGuard` clonable. Core Feature: * Traversal code: track `will_init` in `BlobMeta` and source it from the Delta/Image/InMemory layer index, instead of determining `will_init` after we've read the value. This avoids having to read the value to determine whether traversal can stop. * Introduce `IoConcurrency` & its sidecar task. * `IoConcurrency` is the clonable handle. * It connects to the sidecar task via an `mpsc`. * Plumb through `IoConcurrency` from high level code to the individual layer implementations' `get_values_reconstruct_data`. We piggy-back on the `ValuesReconstructState` for this. * The sidecar task should be long-lived, so, `IoConcurrency` needs to be rooted up "high" in the call stack. * Roots as of this PR: * `page_service`: outside of pagestream loop * `create_image_layers`: when it is called * `basebackup`(only auxfiles + replorigin + SLRU segments) * Code with no roots that uses `IoConcurrency::sequential` * any `Timeline::get` call * `collect_keyspace` is a good example * follow-up: #10460 * `TimelineAdaptor` code used by the compaction simulator, unused in practive * `ingest_xlog_dbase_create` * Transform Delta/Image/InMemoryLayer to * do their values IO in a distinct `async {}` block * extend the residence of the Delta/Image layer until the IO is done * buffer their results in a `oneshot` channel instead of straight in `ValuesReconstructState` * the `oneshot` channel is wrapped in `OnDiskValueIo` / `OnDiskValueIoWaiter` types that aid in expressiveness and are used to keep track of in-flight IOs so we can print warnings if we leave them dangling. * Change `ValuesReconstructState` to hold the receiving end of the `oneshot` channel aka `OnDiskValueIoWaiter`. * Change `get_vectored_impl` to `collect_pending_ios` and issue walredo concurrently, in a `FuturesUnordered`. Testing / Benchmarking: * Support queue-depth in pagebench for manual benchmarkinng. * Add test suite support for setting concurrency mode ps config field via a) an env var and b) via NeonEnvBuilder. * Hacky helper to have sidecar-based IoConcurrency in tests. This will be cleaned up later. More benchmarking will happen post-merge in nightly benchmarks, plus in staging/pre-prod. Some intermediate helpers for manual benchmarking have been preserved in #10466 and will be landed in later PRs. (L0 layer stack generator!) Drive-By: * test suite actually didn't enable batching by default because `config.compatibility_neon_binpath` is always Truthy in our CI environment => https://neondb.slack.com/archives/C059ZC138NR/p1737490501941309 * initial logical size calculation wasn't always polled to completion, which was surfaced through the added WARN logs emitted when dropping a `ValuesReconstructState` that still has inflight IOs. * remove the timing histograms `pageserver_getpage_get_reconstruct_data_seconds` and `pageserver_getpage_reconstruct_seconds` because with planning, value read IO, and walredo happening concurrently, one can no longer attribute latency to any one of them; we'll revisit this when Vlad's work on tracing/sampling through RequestContext lands. * remove code related to `get_cached_lsn()`. The logic around this has been dead at runtime for a long time, ever since the removal of the materialized page cache in #8105. ## Testing Unit tests use the sidecar task by default and run both modes in CI. Python regression tests and benchmarks also use the sidecar task by default. We'll test more in staging and possibly preprod. # Future Work Please refer to the parent epic for the full plan. The next step will be to fold the plumbing of IoConcurrency into RequestContext so that the function signatures get cleaned up. Once `Sequential` isn't used anymore, we can take the next big leap which is replacing the opaque IOs with structs that have well-defined semantics. --------- Co-authored-by: Christian Schwarz <[email protected]>
## Problem https://github.com/neondatabase/neon/actions/runs/12896686483/job/35961290336#step:5:107 showed that `promote-images-prod` was missing another dependency. ## Summary of changes Modify `promote-images-prod` to tag based on docker-hub images, so that `promote-images-prod` does not rely on `promote-images-dev`. The result should be the exact same, but allows the two jobs to run in parallel.
## Problem The trust connection to the compute required for `pg_anon` was removed. However, the PGPASSWORD environment variable was not added to `docker-compose.yml`. This caused connection errors, which were interpreted as success due to errors in the bash script. ## Summary of changes The environment variable was added, and the logic in the bash script was fixed.
## Problem PR #10457 was supposed to fix the flakiness of `test_scrubber_physical_gc_ancestors`, but instead it made it even more flaky. However, the original error causes disappeared, now to be replaced by key not found errors. See this for a longer explanation: #10391 (comment) ## Solution This does one churn rows after all compactions, and before we do any timeline gc's. That way, we remain more accessible at older lsn's.
## Problem Not really a bug fix, but hopefully can reproduce #10482 more. If the layer map does not contain layers that end at exactly the end range of the compaction job, the current split algorithm will produce the last job that ends at the maximum layer key. This patch extends it all the way to the compaction job end key. For example, the user requests a compaction of 0000...FFFF. However, we only have a layer 0000..3000 in the layer map, and the split job will have a range of 0000..3000 instead of 0000..FFFF. This is not a correctness issue but it would be better to fix it so that we can get consistent job splits. ## Summary of changes Compaction job split will always cover the full specified key range. Signed-off-by: Alex Chi Z <[email protected]>
vipvap
requested review from
knizhnik,
ololobus and
arssher
and removed request for
a team
January 23, 2025 06:02
7370 tests run: 6985 passed, 0 failed, 385 skipped (full report)Flaky tests (5)Postgres 17
Postgres 16
Postgres 15
Postgres 14
Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
a38f736 at 2025-01-23T07:02:55.172Z :recycle: |
conradludgate
requested review from
conradludgate
and removed request for
a team,
knizhnik,
ololobus and
arssher
January 23, 2025 10:53
No changes. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Proxy release 2025-01-23
Please merge this Pull Request using 'Create a merge commit' button