Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move saga recovery to a background task #6063

Merged
merged 9 commits into from
Jul 16, 2024
Merged

Conversation

davepacheco
Copy link
Collaborator

tl;dr: This PR moves the execution of saga recovery into a background task and also fixes the implementation so that it can be safely re-run at any time.

#5136 requires that sagas assigned to an expunged Nexus be re-assigned to another Nexus. In principle, this is easy: since we know the first Nexus is dead, we should be able to just update the database for each in-progress saga assigned to that Nexus to say that it's being run by some other Nexus. The regular saga recovery path should take care of the rest.

There are a few problems here:

  • Saga recovery only runs once, during Nexus startup. We'd need a way to re-run it later in Nexus's lifetime.
  • The saga recovery implementation implicitly assumes that this Nexus process will not start any new sagas until recovery completes. But if Nexus is up and running for a while and then we kick off saga recovery again, it's quite possible that the saga recovery process will find sagas that this process has started. Then it will try to recover them, even though they're already running. Not good. (I believe this is already possible, just very unlikely.)
  • Related: saga recovery stops if Nexus fails to recover a saga #5947
  • Related issue: if Nexus fails to recover a saga, it will never try again.

This turned out to be pretty much a rewrite of the existing saga recovery code. (But keep in mind that Steno handles all the details of recovery -- all this stuff was doing was loading state from the database and handing it to Steno.) There's now a bunch more code, but that's because it's factored into smaller pieces that are much easier to observe and test (and there are a lot of tests). With the improved observability, the background task exposes detailed status about sagas recovered, sagas it failed to recover, and other internal state. I went into this with the mindset of: assume this winds up broken in production -- what information will I wish I had to debug it?

The bulk of the changes are in:

  • nexus/saga-recovery is a new Nexus-level crate containing all the bookkeeping, plus lots of tests. This doesn't do anything with the database. It just keeps track of which sagas we've seen, which need to be recovered, etc. and reports on what happened in each saga recovery pass and what's happened since startup.
  • nexus/src/app/background/tasks/saga_recovery.rs is the new background task. It leans on nexus-saga-recovery for the bookkeeping and just adds the bits that read from the database and shove the saga into Steno. There's a smattering of other changes needed for any new background task: tiny additions to config files, omdb support, omdb output, etc.

Fixes #5947.

@davepacheco
Copy link
Collaborator Author

Feel free to ignore this comment. I just wanted to durably record these notes somewhere in case we have to revisit this.

Design notes

After #5962 it's pretty straightforward to move the saga recovery code into a background task. The complexity here is really dealing with the fact that it might run more than once and that it might find sagas that this Nexus process is already running. I wrestled with a bunch of approaches to this problem.

Here's the approach I landed on:

  • The background task owns a set of saga ids that it knows about. Loosely speaking, these are sagas that are running.
    • Sagas are added to this set in one of two ways:
      • In the code path that creates new sagas, before creating the saga, it sends a message over a channel to the saga recovery task. During activation, the saga recovery task reads these messages and adds these newly-created sagas to its set.
      • In the background task, after recovering a saga, the saga's id is added to the set.
    • Loosely speaking, sagas are removed from this set when the database stops reporting them being in-progress. (It's a little more complicated than this because it's possible that it just started and hasn't made it to the database yet. But this is handleable.)
  • The background task always lists all in-progress sagas that are assigned to this Nexus and recovers anything that's not in the set. This includes sagas from a previous Nexus lifetime (the traditional "saga recovery" case) as well as any sagas that have been newly assigned to this Nexus (which will start happening after nexus expungement: assign an old Nexus instance's sagas to another Nexus instance #5136).

This is trickier than it sounds because of the race between checking the database and learning about sagas created by Nexus. But with some care, I believe this is correct and well-tested and it involves no changes to the database.

Alternative: ignore the problem (let Steno deal with it)

We could just completely ignore this problem and let Steno bomb out if we try to recover a saga with the same id as one that's already running. Right now, I think that would work. But it has a few problems:

  • It relies on Steno keeping track of every saga it has ever run in this process. At some point we'll want to prune "done" sagas. At that point, this mechanism would become unsafe because we could start a saga, see it during recovery, then it finishes, then we ask Steno about it, and it's gone, so we go ahead and recover it again. That's really bad. (This will probably fail anyway because of the unique index on the saga table...until we start implementing pruning for that, too. Anyway, this is clearly several levels of "working by accident".)
  • Frankly, I also don't have as much confidence as I'd like in Steno handling this case well. It wasn't intended that you'd go through 90% of the work of starting a saga and then not start it and I'm not sure if correct behavior relies on cancellation safety and whether that code does that correctly.
  • Even if it definitely worked, we'd be doing a lot of wasted work: every time the task is activated, we'd load the entire saga log for all in-progress sagas, only to have most of them thrown away when we realize it's a duplicate.

We could improve the first problem by implementing purging in Steno and making the saga recovery task responsible for it. That is: once the saga recovery task confirms that a saga is done, it purges it from Steno. Then maybe it's safe to rely on Steno bombing out if we give it a duplicate id. We still have the other two problems and the second of those means that even in the best case, this solution isn't great.

Alternative: store SEC generation in the database

I went down this path:

  • Every time Nexus starts up, generate a unique SEC generation. It could be a millisecond timestamp or a uuid.
  • Add the SEC generation to each saga. When the saga is created, it gets Nexus's current SEC generation.
  • During recovery, instead of listing all in-progress sagas and then ignoring the ones we already know about, just list the ones from an SEC generation different from ours. Update the SEC generation to ours when we recover the saga.
  • For nexus expungement: assign an old Nexus instance's sagas to another Nexus instance #5136, we'd set the SEC generation of the expunged Nexus's sagas to some invalid value so that it's found by the new Nexus's recovery path.

The big "pro" here is that Nexus only ever lists the sagas it needs to recover, instead of listing all in-progress sagas. This would be a big benefit if we got to the point that the number of in-progress sagas was really large. It's also pretty clean -- it's hard to see this approach accidentally recovering a saga twice. But it's kind of tricky to test.

This has a few cons: storing stuff in the database means we have a schema change, we're storing more data, and the information is replicated 5x. The volume of data is small enough that it probably doesn't matter (but that also undermines the upside). It bothered me about this approach that the problem feels totally internal to Nexus -- the database shouldn't need to know about this. Also, the fact that we're not listing all in-progress sagas means that if we make a mistake somehow (but still update the SEC generation), we'll never figure that out.

Looking back, it feels like kind of a wash.

Alternative: Steno owns more of the recovery process

This was kind of brought up last year under oxidecomputer/steno#156. You could imagine:

  • The Steno SecStore trait grows methods for reading data: list in-progress sagas (including those that might be running now) and read a particular saga's log.
  • The Steno SEC grows a command to carry out a recovery pass. This would use the SecStore methods to do basically what I did in this PR: list in-progress sagas, compare them to what we know about, and for each one that we don't know about, load the log and resume the saga.
  • We'd still use a background task in Nexus to carry out recovery passes but all the work would be in Steno.

There are more details here, like where does the SagaContext come from for these sagas?

Theoretically, this could be cleaner because some of the complexity in this PR comes from the fact that we're dispatching and recovering sagas in separate contexts and have to synchronize, whereas in principle the Steno SEC is already a single context and the source of truth for what's running. But it may not be easier in practice because the SEC is still concurrent: it can be in the middle of storing the record for a new saga at the same time it's driving a recovery pass -- might it still have the same possibility for ambiguity that we have in Nexus?

There's also more overhead and boilerplate in changing the Steno SEC API.


I think a few of these others are workable and have some advantages. But all of them are a lot of work, including the one I implemented, so my strong inclination is to go with what's in this PR unless there's some major problem I've missed.

@davepacheco
Copy link
Collaborator Author

Here's some example omdb output for the background task. Right after Nexus starts up, having recovered one saga:

$ OMDB_NEXUS_URL=http://[::1]:12223 cargo run --bin=omdb -- nexus background-tasks show
...
task: "saga_recovery"
  configured period: every 10m
  currently executing: no
  last completed activation: iter 1, triggered by a periodic timer firing
    started at 2024-07-12T03:25:40.271Z (103s ago) and ran for 85ms
    since Nexus started:
        sagas recovered:           1
        sagas recovery errors:     0
        sagas observed started:    0
        sagas inferred finished:   0
        missing from SEC:          0
        bad state in SEC:          0
    last pass:
        found sagas:   1 (in-progress, assigned to this Nexus)
        recovered:     1 (successfully)
        failed:        0
        skipped:       0 (already running)
        removed:       0 (newly finished)
    recently recovered sagas (1):
        TIME                 SAGA_ID                              
        2024-07-12T03:25:40Z 9ebb7ce7-fd16-446c-90eb-0b95a699ed0d 
    no saga recovery failures
...

I activated the task manually with omdb and checked status again:

task: "saga_recovery"
  configured period: every 10m
  currently executing: no
  last completed activation: iter 2, triggered by an explicit signal
    started at 2024-07-12T03:28:14.706Z (5s ago) and ran for 23ms
    since Nexus started:
        sagas recovered:           1
        sagas recovery errors:     0
        sagas observed started:    0
        sagas inferred finished:   0
        missing from SEC:          0
        bad state in SEC:          0
    last pass:
        found sagas:   0 (in-progress, assigned to this Nexus)
        recovered:     0 (successfully)
        failed:        0
        skipped:       0 (already running)
        removed:       0 (newly finished)
    recently recovered sagas (1):
        TIME                 SAGA_ID                              
        2024-07-12T03:25:40Z 9ebb7ce7-fd16-446c-90eb-0b95a699ed0d 
    no saga recovery failures

Here's after seeing one saga start and another finish:

task: "saga_recovery"
  configured period: every 10m
  currently executing: no
  last completed activation: iter 3, triggered by an explicit signal
    started at 2024-07-12T03:29:33.299Z (2s ago) and ran for 24ms
    since Nexus started:
        sagas recovered:           1
        sagas recovery errors:     0
        sagas observed started:    1
        sagas inferred finished:   1
        missing from SEC:          0
        bad state in SEC:          1
    last pass:
        found sagas:   0 (in-progress, assigned to this Nexus)
        recovered:     0 (successfully)
        failed:        0
        skipped:       0 (already running)
        removed:       0 (newly finished)
    recently recovered sagas (1):
        TIME                 SAGA_ID                              
        2024-07-12T03:28:55Z 8f982cf3-dbbb-4dd3-bd63-4af853e178c7 
    no saga recovery failures

(the "bad state in SEC" error is a bug in the counter that I have since fixed)

@davepacheco davepacheco requested a review from hawkw July 12, 2024 04:55
@davepacheco davepacheco marked this pull request as ready for review July 12, 2024 04:55
@davepacheco
Copy link
Collaborator Author

Manual testing

I wanted to try some manual stress testing that seemed hard to do in the existing test suite. For both of these I modified omdb to add a project-blast command that creates projects using the external API in a tight loop with a specified level of concurrency. Diffs:

$ git diff
diff --git a/Cargo.lock b/Cargo.lock
index 17b46a244..15770521f 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -5650,6 +5650,7 @@ dependencies = [
  "gateway-client",
  "gateway-messages",
  "gateway-test-utils",
+ "http 0.2.12",
  "humantime",
  "indicatif",
  "internal-dns",
@@ -5670,13 +5671,17 @@ dependencies = [
  "omicron-test-utils",
  "omicron-uuid-kinds",
  "omicron-workspace-hack",
+ "oxide-client",
  "oximeter-client",
  "oximeter-db",
  "pq-sys",
  "ratatui",
  "reedline",
+ "reqwest",
  "serde",
  "serde_json",
+ "signal-hook",
+ "signal-hook-tokio",
  "sled-agent-client",
  "slog",
  "slog-error-chain",
diff --git a/dev-tools/omdb/Cargo.toml b/dev-tools/omdb/Cargo.toml
index 0990fdb11..996215273 100644
--- a/dev-tools/omdb/Cargo.toml
+++ b/dev-tools/omdb/Cargo.toml
@@ -60,6 +60,13 @@ omicron-workspace-hack.workspace = true
 multimap.workspace = true
 indicatif.workspace = true
 
+# XXX-dap
+http.workspace = true
+oxide-client.workspace = true
+reqwest.workspace = true
+signal-hook.workspace = true
+signal-hook-tokio.workspace = true
+
 [dev-dependencies]
 expectorate.workspace = true
 nexus-test-utils.workspace = true
diff --git a/dev-tools/omdb/src/bin/omdb/main.rs b/dev-tools/omdb/src/bin/omdb/main.rs
index 8fc48f502..5dd85ce62 100644
--- a/dev-tools/omdb/src/bin/omdb/main.rs
+++ b/dev-tools/omdb/src/bin/omdb/main.rs
@@ -36,12 +36,20 @@
 use anyhow::anyhow;
 use anyhow::ensure;
 use anyhow::Context;
+use clap::Args;
 use clap::Parser;
 use clap::Subcommand;
 use futures::StreamExt;
+use futures::TryStreamExt;
 use omicron_common::address::Ipv6Subnet;
+use oxide_client::ClientProjectsExt;
+use signal_hook::consts::signal::SIGINT;
+use signal_hook_tokio::Signals;
 use std::net::SocketAddr;
 use std::net::SocketAddrV6;
+use std::sync::atomic::AtomicU64;
+use std::sync::atomic::Ordering;
+use std::sync::Arc;
 use tokio::net::TcpSocket;
 
 mod crucible_agent;
@@ -71,6 +79,11 @@ async fn main() -> Result<(), anyhow::Error> {
         OmdbCommands::Oxql(oxql) => oxql.run_cmd(&args, &log).await,
         OmdbCommands::SledAgent(sled) => sled.run_cmd(&args, &log).await,
         OmdbCommands::CrucibleAgent(crucible) => crucible.run_cmd(&args).await,
+
+        OmdbCommands::ProjectBlast(args) => run_project_blast(&args).await,
+        OmdbCommands::ProjectBlastDelete(args) => {
+            run_project_blast_delete(&args).await
+        }
     }
 }
 
@@ -275,6 +288,10 @@ enum OmdbCommands {
     Oxql(oxql::OxqlArgs),
     /// Debug a specific Sled
     SledAgent(sled_agent::SledAgentArgs),
+
+    // XXX-dap dap stuff
+    ProjectBlast(ProjectBlastArgs),
+    ProjectBlastDelete(ProjectBlastDeleteArgs),
 }
 
 fn parse_dropshot_log_level(
@@ -282,3 +299,197 @@ fn parse_dropshot_log_level(
 ) -> Result<dropshot::ConfigLoggingLevel, anyhow::Error> {
     serde_json::from_str(&format!("{:?}", s)).context("parsing log level")
 }
+
+#[derive(Debug, Args)]
+struct ProjectBlastArgs {
+    external_url: String,
+    silo_name: String,
+    username: String,
+    password: String,
+    concurrency: usize,
+    #[clap(default_value_t = 1000)]
+    pause_ms: u64,
+    #[clap(long, action)]
+    ignore_errors: bool,
+}
+
+async fn run_project_blast(
+    args: &ProjectBlastArgs,
+) -> Result<(), anyhow::Error> {
+    let login_url =
+        format!("{}/v1/login/{}/local", args.external_url, args.silo_name);
+    let reqwest_builder = reqwest::ClientBuilder::new();
+    let session_token = oxide_client::login(
+        reqwest_builder,
+        &login_url,
+        args.username.parse().with_context(|| {
+            format!("parsing username {:?}", &args.username)
+        })?,
+        args.password.parse().context("parsing password")?,
+    )
+    .await
+    .with_context(|| format!("logging in at {:?}", &login_url))?;
+
+    let signals = Signals::new(&[SIGINT]).expect("failed to wait for SIGINT");
+    let mut signal_stream = signals.fuse();
+
+    let nstarted = Arc::new(AtomicU64::new(0));
+    let nfailed = Arc::new(AtomicU64::new(0));
+    let ndone = Arc::new(AtomicU64::new(0));
+
+    println!("creating projects until ^C");
+    let prefix = "dummy-";
+    let mut tasks = Vec::with_capacity(args.concurrency);
+    for which in 0..args.concurrency {
+        let mywhich = which;
+        let session_token = session_token.clone();
+        let nstarted = nstarted.clone();
+        let nfailed = nfailed.clone();
+        let ndone = ndone.clone();
+        let url = args.external_url.clone();
+        let ignore_errors = args.ignore_errors;
+        let pause_ms = args.pause_ms;
+        let task = tokio::spawn(async move {
+            let mut headers = http::header::HeaderMap::new();
+            let header_name = http::header::COOKIE;
+            let header_value = http::HeaderValue::from_str(&format!(
+                "session={}",
+                session_token
+            ))
+            .unwrap();
+            headers.append(header_name, header_value);
+            let reqwest_client = reqwest::ClientBuilder::new()
+                .default_headers(headers)
+                .build()
+                .unwrap();
+            let client =
+                oxide_client::Client::new_with_client(&url, reqwest_client);
+
+            let mut i = 0;
+            loop {
+                i += 1;
+                let project_name = format!("{}{}-{}", prefix, mywhich, i);
+                nstarted.fetch_add(1, Ordering::SeqCst);
+                let result = client
+                    .project_create()
+                    .body_map(|p| {
+                        p.name(&project_name).description("who cares")
+                    })
+                    .send()
+                    .await;
+                match result {
+                    Ok(_) => {
+                        ndone.fetch_add(1, Ordering::SeqCst);
+                        eprint!(".");
+                    }
+                    Err(error) => {
+                        nfailed.fetch_add(1, Ordering::SeqCst);
+                        if ignore_errors {
+                            eprint!("x");
+                        } else {
+                            panic!(
+                                "unexpected error creating {:?}: {:#}",
+                                project_name, error
+                            );
+                        }
+                    }
+                }
+
+                tokio::time::sleep(std::time::Duration::from_millis(pause_ms))
+                    .await;
+            }
+        });
+        tasks.push(task);
+    }
+
+    let caught_signal = signal_stream.next().await;
+    assert_eq!(caught_signal.unwrap(), SIGINT);
+    eprintln!("caught signal, shutting down");
+    println!(
+        "at signal: nstarted = {}, ndone = {}, nfailed = {}",
+        nstarted.load(Ordering::SeqCst),
+        ndone.load(Ordering::SeqCst),
+        nfailed.load(Ordering::SeqCst)
+    );
+    for task in &tasks {
+        task.abort();
+    }
+    for task in tasks {
+        let _ = task.await;
+    }
+    println!(
+        "final: nstarted = {}, ndone = {}, nfailed = {}",
+        nstarted.load(Ordering::SeqCst),
+        ndone.load(Ordering::SeqCst),
+        nfailed.load(Ordering::SeqCst)
+    );
+
+    Ok(())
+}
+
+#[derive(Debug, Args)]
+struct ProjectBlastDeleteArgs {
+    external_url: String,
+    silo_name: String,
+    username: String,
+    password: String,
+}
+
+async fn run_project_blast_delete(
+    args: &ProjectBlastDeleteArgs,
+) -> Result<(), anyhow::Error> {
+    let login_url =
+        format!("{}/v1/login/{}/local", args.external_url, args.silo_name);
+    let reqwest_builder = reqwest::ClientBuilder::new();
+    let session_token = oxide_client::login(
+        reqwest_builder,
+        &login_url,
+        args.username.parse().with_context(|| {
+            format!("parsing username {:?}", &args.username)
+        })?,
+        args.password.parse().context("parsing password")?,
+    )
+    .await
+    .with_context(|| format!("logging in at {:?}", &login_url))?;
+
+    let mut headers = http::header::HeaderMap::new();
+    let header_name = http::header::COOKIE;
+    let header_value =
+        http::HeaderValue::from_str(&format!("session={}", session_token))
+            .unwrap();
+    headers.append(header_name, header_value);
+    let reqwest_client =
+        reqwest::ClientBuilder::new().default_headers(headers).build().unwrap();
+    let client = oxide_client::Client::new_with_client(
+        &args.external_url,
+        reqwest_client,
+    );
+
+    println!("deleting projects");
+    let prefix = "dummy-";
+    let ndeleted = Arc::new(AtomicU64::new(0));
+    let mut stream = client.project_list().stream();
+    while let Some(project) =
+        stream.try_next().await.context("listing projects")?
+    {
+        if !project.name.starts_with(prefix) {
+            println!("skipping {}", project.name.to_string());
+            continue;
+        }
+
+        println!("deleting {}", project.name.to_string());
+        client
+            .project_delete()
+            .project(project.name.clone())
+            .send()
+            .await
+            .with_context(|| {
+                format!("delete project {:?}", project.name.to_string())
+            })?;
+        ndeleted.fetch_add(1, Ordering::SeqCst);
+    }
+
+    println!("done: {} projects deleted", ndeleted.load(Ordering::SeqCst));
+
+    Ok(())
+}

(note: project-blast-delete does not work because it does not delete the VPCs in the newly-created projects)

Stress test: furious recovery while launching sagas

Procedure:

  1. Start a stack with cargo run --bin=omicron-dev -- run-all
  2. In another terminal, adjust the example Nexus config file based on the various ports, etc. reported by omicron-dev run-all.
  3. Start a second Nexus using that config file with cargo run --bin=nexus config.toml
  4. In another terminal: cargo run --bin=omdb -- project-blast http://127.0.0.1:12222 test-suite-silo test-privileged oxide 16 200 --ignore-errors (you'd have to change the port to match how you tweaked the second Nexus config.toml). 16 here is the concurrency and 200ms is the pause time between requests in each task. (I also ran this with concurrency 4 and no pause time and got similar results, but the results below are from this run with concurrency=16.)
  5. In another terminal: while :; do OMDB_NEXUS_URL=http://[::1]:12223 cargo run --bin=omdb -- -w nexus background-tasks activate saga_recovery; done

This constantly activates the background task while constantly creating new sagas. I wanted to stress-test the path that keeps track of in-progress sagas. I let this run for a while.

Here's how I had to tweak the Nexus config:

dap@ivanova omicron-work $ diff nexus/examples/config.toml config.toml 
4a5,7
> mgd.switch0.address = "[::1]:42745"
> mgd.switch1.address = "[::1]:44345"
> 
17c20
< level = "info"
---
> level = "debug"
20c23
< mode = "stderr-terminal"
---
> #mode = "stderr-terminal"
23,25c26,28
< #mode = "file"
< #path = "logs/server.log"
< #if_exists = "append"
---
> mode = "file"
> path = "nexus.log"
> if_exists = "append"
33c36
< id = "e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c"
---
> id = "333518da-5601-43e8-b9bd-f675571c5797"
34a38
> techport_external_server_port = 0
42c46
< bind_address = "127.0.0.1:12220"
---
> bind_address = "127.0.0.1:12222"
59c63
< bind_address = "[::1]:12221"
---
> bind_address = "[::1]:12223"
67c71
< address = "[::1]:3535"
---
> address = "[::1]:63080"
72c76
< url = "postgresql://root@[::1]:32221/omicron?sslmode=disable"
---
> url = "postgresql://root@[::1]:36897/omicron?sslmode=disable"

(Some of these seem like bugs. I'll file some more issues.)

project-blast reported:

creating projects until ^C
...................................................................................................................
...
^Ccaught signal, shutting down
at signal: nstarted = 3142, ndone = 3133, nfailed = 0
final: nstarted = 3142, ndone = 3133, nfailed = 0

(I elided a bunch of dots there. Each dot is a project created. There'd be an 'x' if one failed.)

I also ran omdb nexus background-tasks show and saw:

task: "saga_recovery"
  configured period: every 10m
  currently executing: no
  last completed activation: iter 94, triggered by an explicit signal
    started at 2024-07-12T18:07:58.263Z (17s ago) and ran for 4ms
    since Nexus started:
        sagas recovered:           0
        sagas recovery errors:     0
        sagas observed started:  3142
        sagas inferred finished: 3142
        missing from SEC:          0
        bad state in SEC:          0
    last pass:
        found sagas:   0 (in-progress, assigned to this Nexus)
        recovered:     0 (successfully)
        failed:        0
        skipped:       0 (already running)
        removed:       0 (newly finished)
    no recovered sagas
    no saga recovery failures

Here's what's in the database:

root@[::1]:44431/omicron> select count(*) from project where name like 'dummy-%';
  count
---------
   3142
(1 row)

...

root@[::1]:44431/omicron> select name,saga_state,count(*) from saga group by name, saga_state;
       name      | saga_state | count
-----------------+------------+--------
  project-create | done       |  3142
(1 row)

This is all consistent:

  • The tool reports creating 3142 projects, which should have created 3142 sagas
  • The background task reports having seen 3142 sagas start and 3142 sagas finish
  • The task did not attempt to recover any of these (good!)
  • There are indeed 3142 projects with a name matching what the tool creates
  • There are indeed 3142 project-create sagas and they're all done

Let's look at the Nexus log:

$ grep -i saga_recovery nexus-repeated-activations.log | json -ga msg | sort | uniq -c | sort -n
  94 activating
  94 activation complete
  94 listed in-progress sagas
 636 found saga that can be ignored (already running)
3142 found saga that appears to be done (missing from two database listings)
3142 found saga that may be done (will be sure on the next pass)
3142 observed saga start

That's all expected. It's nice to see that 636 times we ignored a saga because it was running -- that means the task caught a bunch of sagas in the middle of things.

There are no warning-or-worse log messages from this task:

$ grep -i saga_recovery nexus-repeated-activations.log | looker -lwarn
$

Other tasks are spewing lots of warnings (will file some bugs). But if we ignore those, the only warnings we have are for having cancelled requests:

$ grep -v background_task nexus-repeated-activations.log | looker -lwarn
18:07:53.172Z WARN 333518da-5601-43e8-b9bd-f675571c5797 (dropshot_external): request handling cancelled (client disconnected)
    file = /home/dap/.cargo/git/checkouts/dropshot-a4a923d29dccc492/6a3f84c/dropshot/src/server.rs:815
    latency_us = 286170
    local_addr = 127.0.0.1:12222
    method = POST
    remote_addr = 127.0.0.1:49492
    req_id = 6dae2526-4fca-4939-ac6d-b96dfcd502c2
    uri = /v1/projects
...
$ grep -v background_task nexus-repeated-activations.log | grep -v 'request completed after handler was already cancelled' | grep -v 'request handling cancelled' | looker -lwarn
$

Kill Nexus with lots of sagas ongoing

Procedure:

  1. As above: start a stack with cargo run --bin=omicron-dev -- run-all
  2. As above: tweak the Nexus config and start a second Nexus.
  3. As above: in another terminal start omdb project-blast (concurrency = 64)
  4. After a while, ^C Nexus
  5. Save the log file up to this point
  6. Start Nexus again

Here's what omdb shows after starting Nexus again:

task: "saga_recovery"
  configured period: every 10m
  currently executing: no
  last completed activation: iter 1, triggered by a periodic timer firing
    started at 2024-07-12T18:32:32.393Z (19s ago) and ran for 654ms
    since Nexus started:
        sagas recovered:          31
        sagas recovery errors:     0
        sagas observed started:    0
        sagas inferred finished:   0
        missing from SEC:          0
        bad state in SEC:          0
    last pass:
        found sagas:  31 (in-progress, assigned to this Nexus)
        recovered:    31 (successfully)
        failed:        0
        skipped:       0 (already running)
        removed:       0 (newly finished)
    recently recovered sagas (31):
        TIME                 SAGA_ID
        2024-07-12T18:32:32Z 052247e0-853d-4d16-80e7-7c618d97edd8
        2024-07-12T18:32:32Z 1b4580da-7bb8-4e6a-b5aa-cbb7d4e91b55
        2024-07-12T18:32:32Z 28412c32-46c1-4070-bb25-123638e4b267
        2024-07-12T18:32:32Z 3427faa2-658f-4666-9ad1-7e7c65d9e959
        2024-07-12T18:32:32Z 45b9aa82-9c28-4b1b-907b-b44a546620de
        2024-07-12T18:32:32Z 5063dfb9-2ce8-435d-8e1a-698f41bd620a
        2024-07-12T18:32:32Z 52bc1f16-0035-498f-9331-2a78e47a6639
        2024-07-12T18:32:32Z 5601a83f-08c2-4a3c-a84e-70680d44cd09
        2024-07-12T18:32:32Z 60b864aa-afac-4500-b4b5-a7c6994059d7
        2024-07-12T18:32:32Z 7116795c-bef4-4870-bcf5-158ec3505fd3
        2024-07-12T18:32:32Z 7d1c78bd-9dcc-4614-9855-e70fed4b37b5
        2024-07-12T18:32:32Z 84cf464b-99e3-4343-8c81-c26b7e11e643
        2024-07-12T18:32:32Z 8c9d579c-ae12-4f83-b39a-27c2050c72e3
        2024-07-12T18:32:32Z 919ea676-a9fb-451d-9775-b3613447d874
        2024-07-12T18:32:32Z 9e415ba8-74ca-4cf1-b143-7bde501395f2
        2024-07-12T18:32:32Z a48168c8-ee69-4890-a1a1-3d25d96d9b94
        2024-07-12T18:32:32Z a4adfcb8-0f17-4154-bd52-1ad83b593fc7
        2024-07-12T18:32:32Z a706e0dd-a693-43d3-b7fa-a269ed3f5abf
        2024-07-12T18:32:32Z b3ec589c-22b9-4e8e-a358-d3b1fea32d4d
        2024-07-12T18:32:32Z b7b85c17-fdb2-4d9e-acab-c6f751308274
        2024-07-12T18:32:32Z bef78424-9f59-421e-9c91-9b2138db29b4
        2024-07-12T18:32:32Z c0685095-d70f-4849-a2ad-4ce2b5f6daea
        2024-07-12T18:32:32Z c4162dd2-c1b6-4ce8-b736-52d1133ff946
        2024-07-12T18:32:32Z c9d63185-1817-44a7-b75e-1d18f11d319c
        2024-07-12T18:32:32Z d455e845-232d-4b0e-8bde-26d35c7531d2
        2024-07-12T18:32:32Z dbce4a0c-377e-43f2-a2ad-f41028b1f019
        2024-07-12T18:32:32Z dc6f4312-86e9-47d5-a6da-5c98e3e7067f
        2024-07-12T18:32:32Z decd1a1f-2ad1-4817-96e0-e2a25f6eba42
        2024-07-12T18:32:32Z ef360a79-f4a8-4e64-9ecd-043ed68bbd0c
        2024-07-12T18:32:32Z f5b4d34f-b9c5-4c49-ad53-94d754433a3f
        2024-07-12T18:32:33Z faa59e0f-1222-417d-aac2-ca7c6569a231
    no saga recovery failures

I activated the task again and checked the status and saw:

task: "saga_recovery"
  configured period: every 10m
  currently executing: no
  last completed activation: iter 2, triggered by an explicit signal
    started at 2024-07-12T18:32:59.198Z (14s ago) and ran for 4ms
    since Nexus started:
        sagas recovered:          31
        sagas recovery errors:     0
        sagas observed started:    0
        sagas inferred finished:   0
        missing from SEC:          0
        bad state in SEC:          0
    last pass:
        found sagas:   0 (in-progress, assigned to this Nexus)
        recovered:     0 (successfully)
        failed:        0
        skipped:       0 (already running)
        removed:       0 (newly finished)
    recently recovered sagas (31):
        TIME                 SAGA_ID
        2024-07-12T18:32:32Z 052247e0-853d-4d16-80e7-7c618d97edd8
        2024-07-12T18:32:32Z 1b4580da-7bb8-4e6a-b5aa-cbb7d4e91b55
        2024-07-12T18:32:32Z 28412c32-46c1-4070-bb25-123638e4b267
        2024-07-12T18:32:32Z 3427faa2-658f-4666-9ad1-7e7c65d9e959
        2024-07-12T18:32:32Z 45b9aa82-9c28-4b1b-907b-b44a546620de
        2024-07-12T18:32:32Z 5063dfb9-2ce8-435d-8e1a-698f41bd620a
        2024-07-12T18:32:32Z 52bc1f16-0035-498f-9331-2a78e47a6639
        2024-07-12T18:32:32Z 5601a83f-08c2-4a3c-a84e-70680d44cd09
        2024-07-12T18:32:32Z 60b864aa-afac-4500-b4b5-a7c6994059d7
        2024-07-12T18:32:32Z 7116795c-bef4-4870-bcf5-158ec3505fd3
        2024-07-12T18:32:32Z 7d1c78bd-9dcc-4614-9855-e70fed4b37b5
        2024-07-12T18:32:32Z 84cf464b-99e3-4343-8c81-c26b7e11e643
        2024-07-12T18:32:32Z 8c9d579c-ae12-4f83-b39a-27c2050c72e3
        2024-07-12T18:32:32Z 919ea676-a9fb-451d-9775-b3613447d874
        2024-07-12T18:32:32Z 9e415ba8-74ca-4cf1-b143-7bde501395f2
        2024-07-12T18:32:32Z a48168c8-ee69-4890-a1a1-3d25d96d9b94
        2024-07-12T18:32:32Z a4adfcb8-0f17-4154-bd52-1ad83b593fc7
        2024-07-12T18:32:32Z a706e0dd-a693-43d3-b7fa-a269ed3f5abf
        2024-07-12T18:32:32Z b3ec589c-22b9-4e8e-a358-d3b1fea32d4d
        2024-07-12T18:32:32Z b7b85c17-fdb2-4d9e-acab-c6f751308274
        2024-07-12T18:32:32Z bef78424-9f59-421e-9c91-9b2138db29b4
        2024-07-12T18:32:32Z c0685095-d70f-4849-a2ad-4ce2b5f6daea
        2024-07-12T18:32:32Z c4162dd2-c1b6-4ce8-b736-52d1133ff946
        2024-07-12T18:32:32Z c9d63185-1817-44a7-b75e-1d18f11d319c
        2024-07-12T18:32:32Z d455e845-232d-4b0e-8bde-26d35c7531d2
        2024-07-12T18:32:32Z dbce4a0c-377e-43f2-a2ad-f41028b1f019
        2024-07-12T18:32:32Z dc6f4312-86e9-47d5-a6da-5c98e3e7067f
        2024-07-12T18:32:32Z decd1a1f-2ad1-4817-96e0-e2a25f6eba42
        2024-07-12T18:32:32Z ef360a79-f4a8-4e64-9ecd-043ed68bbd0c
        2024-07-12T18:32:32Z f5b4d34f-b9c5-4c49-ad53-94d754433a3f
        2024-07-12T18:32:33Z faa59e0f-1222-417d-aac2-ca7c6569a231
    no saga recovery failures

Once more:

task: "saga_recovery"
  configured period: every 10m
  currently executing: no
  last completed activation: iter 3, triggered by an explicit signal
    started at 2024-07-12T18:33:34.620Z (4s ago) and ran for 8ms
    since Nexus started:
        sagas recovered:          31
        sagas recovery errors:     0
        sagas observed started:    0
        sagas inferred finished:  31
        missing from SEC:          0
        bad state in SEC:          0
    last pass:
        found sagas:   0 (in-progress, assigned to this Nexus)
        recovered:     0 (successfully)
        failed:        0
        skipped:       0 (already running)
        removed:       0 (newly finished)
    recently recovered sagas (31):
        TIME                 SAGA_ID
        2024-07-12T18:32:32Z 052247e0-853d-4d16-80e7-7c618d97edd8
        2024-07-12T18:32:32Z 1b4580da-7bb8-4e6a-b5aa-cbb7d4e91b55
        2024-07-12T18:32:32Z 28412c32-46c1-4070-bb25-123638e4b267
        2024-07-12T18:32:32Z 3427faa2-658f-4666-9ad1-7e7c65d9e959
        2024-07-12T18:32:32Z 45b9aa82-9c28-4b1b-907b-b44a546620de
        2024-07-12T18:32:32Z 5063dfb9-2ce8-435d-8e1a-698f41bd620a
        2024-07-12T18:32:32Z 52bc1f16-0035-498f-9331-2a78e47a6639
        2024-07-12T18:32:32Z 5601a83f-08c2-4a3c-a84e-70680d44cd09
        2024-07-12T18:32:32Z 60b864aa-afac-4500-b4b5-a7c6994059d7
        2024-07-12T18:32:32Z 7116795c-bef4-4870-bcf5-158ec3505fd3
        2024-07-12T18:32:32Z 7d1c78bd-9dcc-4614-9855-e70fed4b37b5
        2024-07-12T18:32:32Z 84cf464b-99e3-4343-8c81-c26b7e11e643
        2024-07-12T18:32:32Z 8c9d579c-ae12-4f83-b39a-27c2050c72e3
        2024-07-12T18:32:32Z 919ea676-a9fb-451d-9775-b3613447d874
        2024-07-12T18:32:32Z 9e415ba8-74ca-4cf1-b143-7bde501395f2
        2024-07-12T18:32:32Z a48168c8-ee69-4890-a1a1-3d25d96d9b94
        2024-07-12T18:32:32Z a4adfcb8-0f17-4154-bd52-1ad83b593fc7
        2024-07-12T18:32:32Z a706e0dd-a693-43d3-b7fa-a269ed3f5abf
        2024-07-12T18:32:32Z b3ec589c-22b9-4e8e-a358-d3b1fea32d4d
        2024-07-12T18:32:32Z b7b85c17-fdb2-4d9e-acab-c6f751308274
        2024-07-12T18:32:32Z bef78424-9f59-421e-9c91-9b2138db29b4
        2024-07-12T18:32:32Z c0685095-d70f-4849-a2ad-4ce2b5f6daea
        2024-07-12T18:32:32Z c4162dd2-c1b6-4ce8-b736-52d1133ff946
        2024-07-12T18:32:32Z c9d63185-1817-44a7-b75e-1d18f11d319c
        2024-07-12T18:32:32Z d455e845-232d-4b0e-8bde-26d35c7531d2
        2024-07-12T18:32:32Z dbce4a0c-377e-43f2-a2ad-f41028b1f019
        2024-07-12T18:32:32Z dc6f4312-86e9-47d5-a6da-5c98e3e7067f
        2024-07-12T18:32:32Z decd1a1f-2ad1-4817-96e0-e2a25f6eba42
        2024-07-12T18:32:32Z ef360a79-f4a8-4e64-9ecd-043ed68bbd0c
        2024-07-12T18:32:32Z f5b4d34f-b9c5-4c49-ad53-94d754433a3f
        2024-07-12T18:32:33Z faa59e0f-1222-417d-aac2-ca7c6569a231
    no saga recovery failures

We'd have expected anywhere from 0 - 64 sagas in-progress when it crashed. It appears we had 31.

The tool spit out:

$ cargo run --bin=omdb -- project-blast http://127.0.0.1:12222 test-suite-silo test-privileged oxide 64 --ignore-errors
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.92s
     Running `target/debug/omdb project-blast 'http://127.0.0.1:12222' test-suite-silo test-privileged oxide 64 --ignore-errors`
creating projects until ^C
.......................................................................................................................................................................................................................................................................................................................................................................................................................xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx^Ccaught signal, shutting down
at signal: nstarted = 567, ndone = 407, nfailed = 132
final: nstarted = 567, ndone = 407, nfailed = 132

From this we can infer that at least 407 sagas were created and no more than 567. We don't know from the tool output how many were in progress when it crashed but it shouldn't have been more than 64 (the concurrency limit). I'm not sure how much else we can infer from this. In particular, of those 132 failures, some may have created sagas and some may not have. This is complicated further because in principle all sagas completed should have succeeded, but they didn't due to some bugs in the sagas (for which I will file separate issues).

Anyway, there are in fact 438 projects from our tool and 438 project-create sagas:

root@[::1]:36897/omicron> select name,saga_state,count(*) from saga group by name, saga_state;
       name      | saga_state | count
-----------------+------------+--------
  project-create | done       |   438
(1 row)

...

root@[::1]:36897/omicron> select count(*) from project where name like 'dummy-%';
  count
---------
    438
(1 row)

That matches the 407 known to have completed + 31 recovered. That implies that some of ones that project-blast reported having failed did complete and that makes sense. The project-create HTTP requests are synchronous, so if Nexus crashes during the request, the request will fail, but the saga may have been created and then recovered and then completed successfully.

Let's look at what happened with the sagas:

root@[::1]:36897/omicron> select event_type,count(*) from saga_node_event GROUP BY event_type;
   event_type   | count
----------------+--------
  started       |  7866
  succeeded     |  7862
  undo_failed   |     3
  undo_started  |    23
  undo_finished |    20
  failed        |     4
(6 rows)

We had four action failures and then three undo action failures! I dug into these to be sure this wasn't somehow the fault of saga recovery (e.g., replaying an action or something? but that's always supposed to be legal anyway...)

If we ignore the messages from background tasks as before, we see some of these errors:

18:32:33.466Z WARN 333518da-5601-43e8-b9bd-f675571c5797 (ServerContext): unexpectedly failed to send response to SEC client
    file = /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/steno-0.4.1/src/sec.rs:871
    saga_id = bef78424-9f59-421e-9c91-9b2138db29b4
    saga_name = project-create
    sec_id = 333518da-5601-43e8-b9bd-f675571c5797

I thought that wasn't a problem, though now I'm not sure why not. I'll come back to this. Anyway, ignoring that one:

dap@ivanova omicron-work $ grep -v background_task nexus-crash-after.log | grep -v 'request completed after handler was already cancelled' | grep -v 'request handling cancelled' | grep -v 'unexpectedly failed to send response to SEC client' | looker -lwarn
18:32:33.259Z ERRO 333518da-5601-43e8-b9bd-f675571c5797 (ServerContext): saga finished
    action_error_node_name = "subnet"
    action_error_source = ActionFailed { source_error: Object {"InternalError": Object {"internal_message": String("Failed to create default VPC Subnet, found overlapping IP address ranges")}} }
    file = /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/steno-0.4.1/src/sec.rs:1039
    result = failure
    saga_id = 919ea676-a9fb-451d-9775-b3613447d874
    saga_name = project-create
    sec_id = 333518da-5601-43e8-b9bd-f675571c5797
    undo_error = PermanentFailure { source_error: Object {"message": String("undo action attempt 1: Invalid Request: VPC cannot be deleted while VPC Subnets exist")} }
    undo_error_node_name = "vpc"
18:32:33.289Z ERRO 333518da-5601-43e8-b9bd-f675571c5797 (ServerContext): saga finished
    action_error_node_name = "subnet"
    action_error_source = ActionFailed { source_error: Object {"InternalError": Object {"internal_message": String("Failed to create default VPC Subnet, found overlapping IP address ranges")}} }
    file = /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/steno-0.4.1/src/sec.rs:1039
    result = failure
    saga_id = decd1a1f-2ad1-4817-96e0-e2a25f6eba42
    saga_name = project-create
    sec_id = 333518da-5601-43e8-b9bd-f675571c5797
    undo_error = PermanentFailure { source_error: Object {"message": String("undo action attempt 1: Invalid Request: VPC cannot be deleted while VPC Subnets exist")} }
    undo_error_node_name = "vpc"
18:32:33.323Z ERRO 333518da-5601-43e8-b9bd-f675571c5797 (ServerContext): saga finished
    action_error_node_name = "subnet"
    action_error_source = ActionFailed { source_error: Object {"InternalError": Object {"internal_message": String("Failed to create default VPC Subnet, found overlapping IP address ranges")}} }
    file = /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/steno-0.4.1/src/sec.rs:1039
    result = failure
    saga_id = dbce4a0c-377e-43f2-a2ad-f41028b1f019
    saga_name = project-create
    sec_id = 333518da-5601-43e8-b9bd-f675571c5797
    undo_error = PermanentFailure { source_error: Object {"message": String("undo action attempt 1: Invalid Request: VPC cannot be deleted while VPC Subnets exist")} }
    undo_error_node_name = "vpc"
18:32:33.350Z WARN 333518da-5601-43e8-b9bd-f675571c5797 (ServerContext): saga finished
    action_error_node_name = "route4"
    action_error_source = ActionFailed { source_error: Object {"ObjectAlreadyExists": Object {"object_name": String("default-v4"), "type_name": String("router-route")}} }
    file = /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/steno-0.4.1/src/sec.rs:1044
    result = failure
    saga_id = c4162dd2-c1b6-4ce8-b736-52d1133ff946
    saga_name = project-create
    sec_id = 333518da-5601-43e8-b9bd-f675571c5797
    undo_result = success

So we have three failures from node "subnet" that look the same. Looking at the implementation of svc_create_subnet() (the action I suspect is failing), I think it's not idempotent. It attempts to insert a record and then fails with a conflict if it can't do so. I'll file a bug for this. In all of these, the undo action also fails trying to delete the VPC because it already has a subnet. I'll write this up in a new ticket but I think these are basically one bug, which is that it's not atomic.

We also have a similar non-idempotent-looking failure from node "route4".

There are also these log messages from these sagas' execution:

18:32:32.664Z ERRO 333518da-5601-43e8-b9bd-f675571c5797 (ServerContext): failed to create default VPC Subnet, IP address range '172.30.0.0/22' overlaps with existing
    background_task = saga_recovery
    file = nexus/src/app/sagas/vpc_create.rs:374
    ipv4_block = Ipv4Net { addr: 172.30.0.0, width: 22 }
    ipv6_block = Ipv6Net { addr: fd25:a89f:414b::, width: 64 }
    saga_id = 919ea676-a9fb-451d-9775-b3613447d874
    saga_name = project-create
    subnet_id = 5f4c24ed-2995-42b3-a220-27cbf1293d27
    vpc_id = e884754e-223a-4540-a674-9aa44fa2669e
18:32:32.890Z ERRO 333518da-5601-43e8-b9bd-f675571c5797 (ServerContext): failed to create default VPC Subnet, IP address range '172.30.0.0/22' overlaps with existing
    background_task = saga_recovery
    file = nexus/src/app/sagas/vpc_create.rs:374
    ipv4_block = Ipv4Net { addr: 172.30.0.0, width: 22 }
    ipv6_block = Ipv6Net { addr: fd2f:760f:9dae::, width: 64 }
    saga_id = dbce4a0c-377e-43f2-a2ad-f41028b1f019
    saga_name = project-create
    subnet_id = 2a387f83-2481-4032-9d12-9993ce19b0b1
    vpc_id = f943f6a5-30dd-4542-8da4-5b296b6bcada
18:32:33.052Z ERRO 333518da-5601-43e8-b9bd-f675571c5797 (ServerContext): failed to create default VPC Subnet, IP address range '172.30.0.0/22' overlaps with existing
    background_task = saga_recovery
    file = nexus/src/app/sagas/vpc_create.rs:374
    ipv4_block = Ipv4Net { addr: 172.30.0.0, width: 22 }
    ipv6_block = Ipv6Net { addr: fdf2:8e62:f32a::, width: 64 }
    saga_id = decd1a1f-2ad1-4817-96e0-e2a25f6eba42
    saga_name = project-create
    subnet_id = 2b29a54b-56b6-4135-836b-1fa8f8d6a5f8
    vpc_id = 60af793c-06be-4f74-93a9-66fee377f1db

I don't see anything here that makes me worried that saga recovery induced any of these failures. As far as I can tell, it recovered an expected number of sagas, exactly once each, and every saga has run to completion one way or the other. I'd welcome other things to look for!

@davepacheco
Copy link
Collaborator Author

Regarding those "unexpectedly failed to send response to SEC client" message, I found that they're generally preceded by messages like these:

18:32:33.466Z INFO 333518da-5601-43e8-b9bd-f675571c5797 (ServerContext): saga finished
    file = /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/steno-0.4.1/src/sec.rs:1025
    result = success
    saga_id = bef78424-9f59-421e-9c91-9b2138db29b4
    saga_name = project-create
    sec_id = 333518da-5601-43e8-b9bd-f675571c5797

Looking at the Steno code: it looks like Steno is trying to notify the consumer that the saga has finished. The consumer is allowed to ignore this (dropping their side of the channel) -- and indeed the saga recovery process does that. So I think these just correspond to sagas that completed after having been recovered. There are 28 of them:

$ grep -c 'unexpectedly failed to send response to SEC client' nexus-crash-after.log 
28

The last question might be: why not 31?

@davepacheco
Copy link
Collaborator Author

davepacheco commented Jul 12, 2024

The three sagas that were recovered but didn't produce this message were:

1b4580da-7bb8-4e6a-b5aa-cbb7d4e91b55
5063dfb9-2ce8-435d-8e1a-698f41bd620a
84cf464b-99e3-4343-8c81-c26b7e11e643

These completed normally so I'm not sure what's special about them. I dug into 84cf464b-99e3-4343-8c81-c26b7e11e643 and found that the saga finished before the recovery pass was over, so the completion future would not have been dropped yet, which probably explains why we didn't get that error.

@hawkw
Copy link
Member

hawkw commented Jul 12, 2024

@davepacheco regarding your notes in #6063 (comment), under "Alternative: store SEC generation in the database", you wrote that:

This has a few cons: storing stuff in the database means we have a schema change, we're storing more data, and the information is replicated 5x

Can you elaborate a bit on why there's a 5x replication of data here? That wasn't obvious to me from the description you've written --- I assume this is because of some detail of how Steno works internally that I'm not familiar with? Or am I just overlooking something from your comment?

Thanks! And, by the way, thanks so much for the detailed design discussion here, it's very helpful!

Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thus far, I've only reviewed the nexus-saga-recovery crate, and not the background test. The implementation looks remarkably straightforward despite the complex behavior, and I appreciate the documentation! Overall, really nice work --- I'll try to get to the rest of the background task soon.

nexus/saga-recovery/src/lib.rs Outdated Show resolved Hide resolved
nexus/src/app/background/tasks/saga_recovery.rs Outdated Show resolved Hide resolved
nexus/saga-recovery/src/lib.rs Show resolved Hide resolved
nexus/saga-recovery/src/lib.rs Outdated Show resolved Hide resolved
nexus/saga-recovery/src/recovery.rs Show resolved Hide resolved
nexus/saga-recovery/src/recovery.rs Show resolved Hide resolved
Comment on lines +511 to +521
// Send some numbers and make sure we get them back.
let expected_numbers = vec![1, 7, 0, 1];
for n in &expected_numbers {
tx.send(*n).unwrap();
}
let (numbers, disconnected) = read_all_from_channel(&mut rx);
assert_eq!(expected_numbers, numbers);
assert!(!disconnected);

// Send some more numbers and make sure we get them back.
let expected_numbers = vec![9, 7, 2, 0, 0, 6];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this kinda feels like it's within sigma of a property test...not that the behavior is so complex or input-dependent that it needs to be, though...

nexus/saga-recovery/src/status.rs Show resolved Hide resolved
/// load state from the database)
Failed { message: String },
/// This pass was at least partially successful
Success(LastPassSuccess),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really be calling this a "success" if it's really only sort of a success? "Planned" might be more descriptive?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you're saying, though "Planned" to me suggests "not yet executed". "Completed"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷‍♀️ Success is fine with me, honestly, but Completed is also good. Up to you!

@davepacheco
Copy link
Collaborator Author

@davepacheco regarding your notes in #6063 (comment), under "Alternative: store SEC generation in the database", you wrote that:

This has a few cons: storing stuff in the database means we have a schema change, we're storing more data, and the information is replicated 5x

Can you elaborate a bit on why there's a 5x replication of data here? That wasn't obvious to me from the description you've written --- I assume this is because of some detail of how Steno works internally that I'm not familiar with? Or am I just overlooking something from your comment?

It's just that if we store it in memory as we do right now, then we have one copy of the information and it's in-memory in Nexus. If we instead put an SEC generation into each row of the saga table, we're essentially putting the same information into CockroachDB. CockroachDB is currently configured to replicate the omicron database 5x, so the information winds up at least on-disk 5x, probably in-memory as well, and probably more times because of indexes.

It's really not that big a deal because it's so little information to begin with. But when I was evaluating the two options, it seemed much heavier weight. Like I said, looking back, it seems like kind of a wash.

@hawkw
Copy link
Member

hawkw commented Jul 13, 2024

Oh, I see, I wasn't sure whether you were just referring to the CRDB replication factor or something else. That makes sense, thanks!

@hawkw
Copy link
Member

hawkw commented Jul 13, 2024

And, for the record, I think the approach you've gone with makes sense — I like that state which is only relevant to the current Unix process lifetime of a Nexus instance just lives in that process.

Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've finished my review of the Nexus parts of the implementation. Overall, this looks great, and I have no major concerns about the overall design --- very nice work! I left some style feedback, but no hard blockers.

/// Creates a new `OpContext` just like the given one, but with a different
/// identity.
///
/// This is only intended for tests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if this ought to be hidden behind a test_utils feature or something that the integration-test crate can enable? not a big deal though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair question. I don't think we do this a lot and I'm not sure it's worth the extra compilation I'd expect it to trigger. I'm not opposed but I think it makes sense to decide this more holistically so I'm going to defer for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's reasonable. We can think harder about it later.

nexus/db-queries/src/db/datastore/saga.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/saga.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/saga.rs Show resolved Hide resolved
let nexus_saga_recovery::LastPass::Success(last_pass_success) =
status.last_pass
else {
panic!("wrong last pass variant");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

teeny nitpick: should we format what we actually did get in the panic message, here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an easy way to do that with let ... else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, i think you would need to use a match instead. it's not that important...

nexus/src/app/mod.rs Show resolved Hide resolved
Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davepacheco I really appreciate the comments you put in the code and PR to describe the mechanism of saga recovery and the complexity of allowing it to occur at arbitrary times. I think all your logic is correct, although I haven't reviewed most of the test code and probably didn't look as thoroughly as @hawkw.

I think given the level of testing you have done, and Eliza's review, this is probably good to go! My only real qualms about this PR are whether it's possible to do more to guarantee the ordering checks around "ignored" sagas. I'm not sure if explicitly looking for finished sagas in the DB would help in terms of correctness, but it may add a second level of asserts around "we are no longer running these sagas" -> "they show up as finished in the db". Again, this is likely overkill, and up to you to ignore the comment :)

@@ -246,22 +252,20 @@ impl BackgroundTasksInitializer {
///
/// This function will wire up the `Activator`s in `background_tasks` to the
/// corresponding tasks once they've been started.
#[allow(clippy::too_many_arguments)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Comment on lines +220 to +222
// Correctness here requires that the caller has already updated the set
// of sagas that we're ignoring to include any that may have been
// created up to the beginning of the database query. (They do that by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can mostly ignore this comment. I'm just writing it down for lack of a better place and to see if maybe it will pique your interest / spark an idea.

I've read your comments in saga background task and most of that logic, but I'm still a bit concerned that correctness in this data structure requires such specific ordering in the background saga. However, it all appears correct and the excellent testing shows so. I'm wondering if there's a better way to enforce this ordering requirement in code so that it doesn't rely on comments to remain correct. I think one way to do that would be to use some sort of typestate process. But given that if the db query fails, we proceed anyway in the background task, this may add too much complexity in terms of states to be worth it.

@davepacheco
Copy link
Collaborator Author

@andrewjstone and I discussed a bit offline. In retrospect, it may well be worth having the background task do an extra pass listing sagas from the database to resolve the ambiguity about sagas that weren't in the first listing (did they finish or did they just start?). Since this is not an issue of correctness, I'm prioritizing urgency on this for now, but I'm happy to revisit it.

@davepacheco davepacheco enabled auto-merge (squash) July 15, 2024 21:20
@davepacheco davepacheco merged commit b214e28 into main Jul 16, 2024
21 checks passed
@davepacheco davepacheco deleted the dap/saga-recovery-as-task branch July 16, 2024 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

saga recovery stops if Nexus fails to recover a saga
3 participants