rework background task initialization #5962

davepacheco · 2024-06-27T04:55:56Z

The goal here is to be able to initialize enough of the background task subsystem to stuff it into Nexus before actually starting the tasks. That will allow the tasks themselves to be able to receive an Arc<Nexus>, which in turn would allow them to do things like kick off sagas directly and do saga recovery.

There are two main motivations:

The current "saga recovery" process could be folded into the background task subsystem. This would be nice because in order to implement nexus expungement: assign an old Nexus instance's sagas to another Nexus instance #5136 we'll want to re-run saga recovery after Nexus may have been up for a while and the background tasks subsystem would let us just re-activate the task and not otherwise have to worry about stacked activations, etc. But in order to do this, the background task itself needs access to the Arc<Nexus> to construct a SagaContext in order to resume sagas. (This is the same problem as allowing background tasks to run sagas directly. We solved that differently using a channel. We could do something similar here, but the solution in this PR is much more general -- see below.)
nexus expungement: assign an old Nexus instance's sagas to another Nexus instance #5136 will also require that we activate that new saga recovery background task from the blueprint executor, which is currently not easy to do conditionally.

More generally, this approach should make it possible for:

background tasks to activate other background tasks
background tasks to directly use other Nexus subsystems (like sagas) that themselves may want to activate background tasks

This design came out of a discussion Friday with several folks about how to better structure Nexus.

I hope to do some follow-on work in subsequent PRs: seeing if we can remove that saga channel (that would prove that this does what we think it should do), seeing if we can remove the special v2p channel, and eventually move saga recovery into a proper background task, and move on to #5136.

…sk-cleanup

…o dap/bgtask-init-rework

…sk-init-rework

dev-tools/omdb/tests/successes.out

hawkw

This looks great! I left a lot of suggestions, but they're all quite minor.

dev-tools/omdb/tests/successes.out

nexus/src/app/background/driver.rs

hawkw · 2024-06-28T19:09:50Z

nexus/src/app/background/init.rs

+    task_internal_dns_propagation: Activator,
+    task_external_dns_propagation: Activator,
+
+    // Data exposed by various background tasks to the rest of Nexus


looks like a typo:

Suggested change

// Data exposed by various background tasks to the rest of Nexus

/// Data exposed by various background tasks to the rest of Nexus

Heh, it's not: there are three categories of fields in this struct and non-doc comments describe each section. Right now there's only one thing in this bucket (but I expect we'll have more).

nexus/src/app/background/init.rs

hawkw · 2024-06-28T19:11:16Z

nexus/src/app/background/init.rs

+
+            // The following fields can be safely ignored here because they're
+            // already wired up as needed.
+            external_endpoints: _external_endpoints,


nit, take it or leave it: i think this can just be an _ since we don't need the binding to live until the function returns?

Suggested change

external_endpoints: _external_endpoints,

external_endpoints: _,

hawkw · 2024-06-28T19:17:01Z

nexus/src/app/background/init.rs

+    task_config: &Activator,
+    task_servers: &Activator,
+    task_propagation: &Activator,


I'm always going to call out cases where we take multiple positional arguments of the same type, since it's easy to swap them by accident. Since this isn't an API that's called externally to this module, it's not as big a concern: this is only called once, and the arguments are passed in the right order, and we don't really expect new calls to this function will be added. Adding an args struct so that we can name these arguments just for one call seems kinda like overkill.

However, I can think of a pretty simple way to change this so we don't have to pass these as positionals that can be swapped, and also make the clippy::too_many_arguments go away: why not just change this function to a method on &BackgroundTasks (or pass a &BackgroundTasks argument if we don't want to re-indent the entire function in order to make it a method?). That way, this code can refer to the various activators by name, and the caller can't mix them up. I don't think this function also being able to touch the other activators is a problem, since it can just...not do that.

The problem is that this function is called twice with two different sets of activators (internal DNS and then external DNS). The first time, the caller uses BackgroundTasks.task_internal_dns_config, while the second time it uses BackgroundTasks.task_external_dns_config. Same for the other two Activator arguments. So I think these have to be parameters.

hawkw · 2024-06-28T19:19:34Z

nexus/src/app/mod.rs

+                    if let Err(_) =
+                        task_nexus.background_tasks_driver.set(driver)
+                    {
+                        panic!(
+                            "concurrent initialization of \
+                             background_tasks_driver?!"
+                        )
                    }


i was briefly unsure about why this doesn't just use expect(...) but i realized that it requires the error value to be Debug, which i assume this isn't...

hawkw · 2024-06-28T19:23:15Z

nexus/src/app/background/init.rs

-        producer_registry: &ProducerRegistry,
-    ) -> BackgroundTasks {
-        let mut driver = Driver::new();
+        v2p_watcher: (watch::Sender<()>, watch::Receiver<()>),


It seems like the changes in this PR should, hopefully, allow us to get rid of this (yay!). I think there's kind of a tradeoff between doing that in this PR, which gives us a nice worked example of how to use the new interface, but also makes this diff touch a bunch more files, and doing it a follow-up, which results in a much smaller diff. I don't have a particularly strong opinion here, but wanted to ask about your thoughts.

We discussed this one offline as well, and right now, my preference is probably to get rid of it separately in the interest of merging this PR sooner and reducing merge conflict opportunities, but 🤷‍♀️ it's probably not a huge deal either way.

I do plan to look at this in a follow-up.

Currently, a `tokio::sync::watch` channel is explicitly passed around to allow both Nexus and other background tasks to notify the `v2p_manager` background task. This must be passed explicitly through most of Nexus, which is a bit awkward. The new `background::Activator` API introduced in PR #5962 makes it a bit easier to nicely activate background tasks. Now, we have an `Activator` type which represents the `tokio::sync::Notify` used to wake the background task, and this can be passed into the various functions which must activate it. Because `Activator`s are constructed _before_ background tasks are started, the `Activator` can be passed to the background tasks that must activate the `v2p_manager` task (in this case, the `instance_watcher` task). This branch removes the `tokio::sync::watch` channel used to activate the `v2p_manager`, and replaces it with a use of the `Activator` API. Because the `Activator` type is not currently `Clone`, I refactored it slightly so that both the wired-up flag _and_ the `Notify` live within an `Arc`, allowing a clone of the `Activator` for the `v2p_manager` task to be passed into the `instance_watcher` task when it's constructed. I don't think this really introduces any new opportunities for accidental `Activator` misuse, as the assertion that an activator is not wired up twice still stands.

andrewjstone · 2024-07-01T21:48:56Z

nexus/src/app/background/init.rs

@@ -2,7 +2,90 @@
 // License, v. 2.0. If a copy of the MPL was not distributed with this
 // file, You can obtain one at https://mozilla.org/MPL/2.0/.

-//! Specific background task initialization
+//! Initialize Nexus background tasks


This comment is excellent @davepacheco! Thanks for writing it up so thoroughly.

Currently, a `tokio::sync::watch` channel is explicitly passed around to allow both Nexus and other background tasks to notify the `v2p_manager` background task. This must be passed explicitly through most of Nexus, which is a bit awkward. The new `background::Activator` API introduced in PR #5962 makes it a bit easier to nicely activate background tasks. Now, we have an `Activator` type which represents the `tokio::sync::Notify` used to wake the background task, and this can be passed into the various functions which must activate it. Because `Activator`s are constructed _before_ background tasks are started, the `Activator` can be passed to the background tasks that must activate the `v2p_manager` task (in this case, the `instance_watcher` task). This branch removes the `tokio::sync::watch` channel used to activate the `v2p_manager`, and replaces it with a use of the `Activator` API. Because the `Activator` type is not currently `Clone`, I refactored it slightly so that both the wired-up flag _and_ the `Notify` live within an `Arc`, allowing a clone of the `Activator` for the `v2p_manager` task to be passed into the `instance_watcher` task when it's constructed. I don't think this really introduces any new opportunities for accidental `Activator` misuse, as the assertion that an activator is not wired up twice still stands.

davepacheco added 30 commits June 21, 2024 15:19

phase one: move specific task implementations into submodule

fc933eb

phase two: rework imports

79036c4

reorganize the top-level code a bit

7472be7

rustfmt

c259409

wrap too-long strings in background task subsystem

2303eea

task name consistency; avoid using driver.activate() directly

ee8bac5

BackgroundTasks: initial cleanup

87d6d8b

TaskHandle -> TaskName

257f18d

Merge branch 'main' into dap/bgtask-init

df3c5f2

Merge branch 'dap/bgtask-init' into dap/bgtask-cleanup

db53953

Merge branch 'dap/bgtask-cleanup' into dap/bgtask-cleanup-at-large

4b0546b

Merge commit '6e29409c04fb701dd6a9abfafe76c80a4b07f7d2' into dap/bgta…

23ad49e

…sk-cleanup

Merge commit '49f6e01001462fe31c09c83ce423dc73cc3cc1ce' into dap/bgta…

86bd174

…sk-cleanup

Merge branch 'dap/bgtask-cleanup' into dap/bgtask-cleanup-at-large

56222ab

WIP: reworking initialization (still lots of stuff to fix up)

0449bf4

WIP: flesh it out more

4f058f3

Merge remote-tracking branch 'origin/dap/bgtask-cleanup-at-large' int…

e4cd06a

…o dap/bgtask-init-rework

Merge commit 'd9e638a791edfd199fd896df45f73d5edf2e6d87' into dap/bgta…

143c5d6

…sk-init-rework

Merge commit 'ff0c914753022a057d6443540ec3641b35654461' into dap/bgta…

af4475e

…sk-init-rework

Merge commit 'd52aad08c6b77c7dc9dd2e0f3050e4b73faf20fb' into dap/bgta…

9c47188

…sk-init-rework

Merge commit 'b3a1a72951bd92f6fcfb04b21766a4535f0e1ddb' into dap/bgta…

bae3fad

…sk-init-rework

Merge commit '931e2d457c7bf9ad40e4475d78db1aa81938ea70' into dap/bgta…

e004240

…sk-init-rework

Merge remote-tracking branch 'origin/main' into dap/bgtask-init-rework

3a7f5c5

use OnceLock; defer the arguments to start()

87160da

fix up ExternalEndpoints test code

85eff7a

make Activator non-optional

a8e9629

flesh out some docs

e35fbde

doc comment editing

1664e0d

move Activator; continue editing docs

aefe765

fixups

f991eec

davepacheco added 4 commits June 26, 2024 22:10

doc fixup

92ee2cd

exercise Activator in tests

20f7c41

that should be a 503

0883358

fix omdb test output

249ee59

davepacheco marked this pull request as ready for review June 27, 2024 17:22

davepacheco commented Jun 27, 2024

View reviewed changes

dev-tools/omdb/tests/successes.out Show resolved Hide resolved

davepacheco requested review from jgallagher and hawkw June 27, 2024 17:22

davepacheco mentioned this pull request Jun 27, 2024

remove need for saga channel for background tasks #5964

Merged

hawkw approved these changes Jun 28, 2024

View reviewed changes

hawkw reviewed Jun 28, 2024

View reviewed changes

review feedback

1dcf10f

davepacheco enabled auto-merge (squash) June 28, 2024 21:02

davepacheco mentioned this pull request Jun 28, 2024

Nexus busted after its initial startup raced with another Nexus populating "system" VpcRouter #5980

Open

start background tasks even if populate fails (see 5980)

514e2b2

davepacheco merged commit f275827 into main Jun 29, 2024
19 checks passed

davepacheco deleted the dap/bgtask-init-rework branch June 29, 2024 01:13

hawkw mentioned this pull request Jul 1, 2024

[nexus] Remove v2p_notification_tx #5983

Merged

davepacheco mentioned this pull request Jul 1, 2024

Driver::register() takes too many arguments #5985

Merged

andrewjstone reviewed Jul 1, 2024

View reviewed changes

davepacheco mentioned this pull request Jul 12, 2024

move saga recovery to a background task #6063

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rework background task initialization #5962

rework background task initialization #5962

davepacheco commented Jun 27, 2024 •

edited

Loading

hawkw left a comment

hawkw Jun 28, 2024

davepacheco Jun 28, 2024

hawkw Jun 28, 2024

hawkw Jun 28, 2024

davepacheco Jun 28, 2024

hawkw Jun 28, 2024

hawkw Jun 28, 2024

hawkw Jun 28, 2024

davepacheco Jun 28, 2024

andrewjstone Jul 1, 2024

	// Data exposed by various background tasks to the rest of Nexus
	/// Data exposed by various background tasks to the rest of Nexus

	external_endpoints: _external_endpoints,
	external_endpoints: _,

rework background task initialization #5962

rework background task initialization #5962

Conversation

davepacheco commented Jun 27, 2024 • edited Loading

hawkw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davepacheco commented Jun 27, 2024 •

edited

Loading