test failure: `integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates` #6727

hawkw · 2024-09-30T17:50:06Z

On branch #6726 commit b3346d9, I saw this test failure:

running 1 test
test integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates has been running for over 60 seconds
test integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates ... FAILED

failures:

failures:
    integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 365 filtered out; finished in 164.25s


--- STDERR:              omicron-nexus::test_all integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates ---
log file: /tmp/test_all-afb586eba2d02dc7-test_instance_failed_by_instance_watcher_automatically_reincarnates.2148721.0.log
note: configured to log to "/tmp/test_all-afb586eba2d02dc7-test_instance_failed_by_instance_watcher_automatically_reincarnates.2148721.0.log"
DB URL: postgresql://root@[::1]:38623/omicron?sslmode=disable
DB address: [::1]:38623
log file: /tmp/test_all-afb586eba2d02dc7-test_instance_failed_by_instance_watcher_automatically_reincarnates.2148721.2.log
note: configured to log to "/tmp/test_all-afb586eba2d02dc7-test_instance_failed_by_instance_watcher_automatically_reincarnates.2148721.2.log"
log file: /tmp/test_all-afb586eba2d02dc7-test_instance_failed_by_instance_watcher_automatically_reincarnates.2148721.3.log
note: configured to log to "/tmp/test_all-afb586eba2d02dc7-test_instance_failed_by_instance_watcher_automatically_reincarnates.2148721.3.log"
[nexus/tests/integration_tests/instances.rs:1368:23] make_forgotten_instance(&cptestctx, "resurgam",
        InstanceAutoRestartPolicy::BestEffort).await = 26edfe76-56d4-4c9e-9420-9218e91ea268 (instance)
[nexus/tests/integration_tests/instances.rs:1377:5] nexus_test_utils::background::activate_background_task(&cptestctx.internal_client,
        "instance_watcher").await = BackgroundTask {
    current: Idle,
    description: "periodically checks instance states",
    last: Completed(
        LastResultCompleted {
            details: Object {
                "failed_checks": Object {
                    "no_such_instance": Number(1),
                },
                "incomplete_checks": Object {},
                "instance_states": Object {},
                "pruned_instances": Number(0),
                "total_instances": Number(1),
                "update_sagas_queued": Number(1),
            },
            elapsed: Duration {
                nanos: 375909468,
                secs: 1,
            },
            iteration: 2,
            reason: Signaled,
            start_time: 2024-09-30T17:30:10.979761748Z,
        },
    ),
    name: "instance_watcher",
    period: Duration {
        nanos: 0,
        secs: 30,
    },
}
[nexus/tests/integration_tests/instances.rs:1396:5] instance_wait_for_state(client, instance_id, InstanceState::Starting).await = Instance {
    identity: IdentityMetadata {
        id: 26edfe76-56d4-4c9e-9420-9218e91ea268,
        name: Name(
            "resurgam",
        ),
        description: "instance \"resurgam\"",
        time_created: 2024-09-30T17:30:06.993226Z,
        time_modified: 2024-09-30T17:30:06.993226Z,
    },
    project_id: cfb19f9d-8b7f-49d9-a6e5-7fc799bc6be5,
    ncpus: InstanceCpuCount(
        4,
    ),
    memory: ByteCount(
        1073741824,
    ),
    hostname: "the-host",
    runtime: InstanceRuntimeState {
        run_state: Starting,
        time_run_state_updated: 2024-09-30T17:30:12.371281Z,
        time_last_auto_restarted: Some(
            2024-09-30T17:30:12.419981Z,
        ),
    },
    auto_restart_status: InstanceAutoRestartStatus {
        enabled: true,
        cooldown_expiration: Some(
            2024-09-30T18:30:12.419981Z,
        ),
    },
}
[nexus/tests/integration_tests/instances.rs:1402:5] instance_wait_for_vmm_registration(cptestctx, &instance_id).await = ()
[nexus/tests/integration_tests/instances.rs:1404:5] instance_simulate(nexus, &instance_id).await = ()
thread 'integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates' panicked at nexus/tests/integration_tests/instances.rs:5382:19:
instance 26edfe76-56d4-4c9e-9420-9218e91ea268 did not transition to Running after 120s: timed out after 120.235074552s
stack backtrace:
   0: rust_begin_unwind
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:652:5
   1: core::panicking::panic_fmt
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/panicking.rs:72:14
   2: {async_fn#0}
             at ./tests/integration_tests/instances.rs:5382:19
   3: {async_fn#0}
             at ./tests/integration_tests/instances.rs:5333:6
   4: {async_fn#0}
             at ./tests/integration_tests/instances.rs:1407:14
   5: {async_block#0}
             at ./tests/integration_tests/instances.rs:1362:1
   6: poll<&mut dyn core::future::future::Future<Output=()>>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/future/future.rs:123:9
   7: poll<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/future/future.rs:123:9
   8: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:673:57
   9: with_budget<core::task::poll::Poll<()>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>>
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/coop.rs:107:5
  10: budget<core::task::poll::Poll<()>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>>
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/coop.rs:73:5
  11: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:673:25
  12: tokio::runtime::scheduler::current_thread::Context::enter
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:412:19
  13: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:672:36
  14: tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:751:68
  15: tokio::runtime::context::scoped::Scoped<T>::set
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context/scoped.rs:40:9
  16: tokio::runtime::context::set_scheduler::{{closure}}
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context.rs:180:26
  17: try_with<tokio::runtime::context::Context, tokio::runtime::context::set_scheduler::{closure_env#0}<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<()>), tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>, core::option::Option<()>>>, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<()>)>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/thread/local.rs:283:12
  18: std::thread::local::LocalKey<T>::with
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/thread/local.rs:260:9
  19: tokio::runtime::context::set_scheduler
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context.rs:180:9
  20: tokio::runtime::scheduler::current_thread::CoreGuard::enter
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:751:27
  21: tokio::runtime::scheduler::current_thread::CoreGuard::block_on
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:660:19
  22: {closure#0}<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:180:28
  23: tokio::runtime::context::runtime::enter_runtime
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context/runtime.rs:65:16
  24: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:168:9
  25: tokio::runtime::runtime::Runtime::block_on_inner
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/runtime.rs:361:47
  26: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/runtime.rs:335:13
  27: test_instance_failed_by_instance_watcher_automatically_reincarnates
             at ./tests/integration_tests/instances.rs:1362:1
  28: test_all::integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates::{{closure}}
             at ./tests/integration_tests/instances.rs:1365:2
  29: core::ops::function::FnOnce::call_once
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5
  30: core::ops::function::FnOnce::call_once
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: "/tmp/.tmpeagOTP"
        If you would like to access the database for debugging, run the following:

        # Run the database
        cargo xtask db-dev run --no-populate --store-dir "/tmp/.tmpeagOTP/data"
        # Access the database. Note the port may change if you run multiple databases.
        cockroach sql --host=localhost:32221 --insecure
WARN: dropped ClickHouse process without cleaning it up first (there may still be a child process running (PID 2149884) and a temporary directory leaked, /tmp/test_all-afb586eba2d02dc7-test_instance_failed_by_instance_watcher_automatically_reincarnates.2148721.1-clickhouse-u4U4l4)
WARN: dropped DendriteInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: dendrite temporary directory leaked: /tmp/.tmpCeuyEZ
WARN: dropped DendriteInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: dendrite temporary directory leaked: /tmp/.tmpLrWu8z
WARN: dropped MgdInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: mgd temporary directory leaked: /tmp/.tmpAEq29n
WARN: dropped MgdInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: mgd temporary directory leaked: /tmp/.tmpbVjJR5

   Canceling due to test failure: 23 tests still running

Reran the test and it appears to pass. I wonder if it just took slightly longer than 2 minutes for the instance to transition back from Starting -> Running? Feels like a pretty long time though...

Log file:
test_all-afb586eba2d02dc7-test_instance_failed_by_instance_watcher_automatically_reincarnates.2148721.0.log

NOTE: Consider attaching any log files produced by the test.

The text was updated successfully, but these errors were encountered:

hawkw · 2024-09-30T17:50:43Z

This is my test, so I'm gonna try to figure out what's gone wrong here once once I finish more urgent work.

jgallagher · 2024-10-04T14:05:54Z

Saw this on https://github.com/oxidecomputer/omicron/runs/31050364405 / https://buildomat.eng.oxide.computer/wg/0/details/01J9A281VZ5AWSENGSBPS2SZBM/YgsiBVpZeZs8xKwEs3NG8vJWRHaj6HRg4P2sOa8fSLBWOjXN/01J9A28XVZQ35MEGH8KPZZ9Q26. At a glance it looks basically the same (other than a new policy field in the logged structs?), so not sure there's any new info there, but thought I'd note it in case that's wrong.

david-crespo · 2024-10-14T18:24:48Z

Just saw this on #6824.

https://github.com/oxidecomputer/omicron/runs/31513295998

https://buildomat.eng.oxide.computer/wg/0/details/01JA5ZXRB1CWA2BQ4G6XCHX07Q/nEHLkJyxA8Ap0rs1D8ue8tqis5q1jxc2XuUv8R28HeLuJ4dI/01JA5ZY9CP0CVDARCH7TZJAXSV

jgallagher · 2024-12-20T20:46:29Z

Seen again on #7276

https://github.com/oxidecomputer/omicron/pull/7276/checks?check_run_id=34727599196

https://buildomat.eng.oxide.computer/wg/0/details/01JFJSP3KB6YHPRY4TCMP4VZ1A/oDh29FyjEH2gj9cSYdiP8KUpeUTrQAn8Exx9V96JPAQZDGN7/01JFJSQ30EJMFJHE7SG600K3WV

hawkw · 2024-12-20T22:08:13Z

Looking at @jgallagher's logs:

Hmm, it looks like we are correctly creating an instance_start saga for the instance: https://buildomat.eng.oxide.computer/wg/0/artefact/01JFJSP3KB6YHPRY4TCMP4VZ1A/oDh29FyjEH2gj9cSYdiP8KUpeUTrQAn8Exx9V96JPAQZDGN7/01JFJSQ30EJMFJHE7SG600K3WV/01JFJWZDNJZZNZYG1HDTNFZ5J6/test_all-75d38c4cef91c289-test_instance_failed_by_instance_watcher_automatically_reincarnates.143763.0.log?format=x-bunyan#L6313

And, the start saga has successfully told the simulated sled-agent to start that instance: https://buildomat.eng.oxide.computer/wg/0/artefact/01JFJSP3KB6YHPRY4TCMP4VZ1A/oDh29FyjEH2gj9cSYdiP8KUpeUTrQAn8Exx9V96JPAQZDGN7/01JFJSQ30EJMFJHE7SG600K3WV/01JFJWZDNJZZNZYG1HDTNFZ5J6/test_all-75d38c4cef91c289-test_instance_failed_by_instance_watcher_automatically_reincarnates.143763.0.log?format=x-bunyan#L6938

The start saga completes successfully: https://buildomat.eng.oxide.computer/wg/0/artefact/01JFJSP3KB6YHPRY4TCMP4VZ1A/oDh29FyjEH2gj9cSYdiP8KUpeUTrQAn8Exx9V96JPAQZDGN7/01JFJSQ30EJMFJHE7SG600K3WV/01JFJWZDNJZZNZYG1HDTNFZ5J6/test_all-75d38c4cef91c289-test_instance_failed_by_instance_watcher_automatically_reincarnates.143763.0.log?format=x-bunyan#L7170

This log line feels kinda weird, given that it says the background task completed "with errors", but it looks like there were no errors: https://buildomat.eng.oxide.computer/wg/0/artefact/01JFJSP3KB6YHPRY4TCMP4VZ1A/oDh29FyjEH2gj9cSYdiP8KUpeUTrQAn8Exx9V96JPAQZDGN7/01JFJSQ30EJMFJHE7SG600K3WV/01JFJWZDNJZZNZYG1HDTNFZ5J6/test_all-75d38c4cef91c289-test_instance_failed_by_instance_watcher_automatically_reincarnates.143763.0.log?format=x-bunyan#L7206

And then, the entire rest of the test seems to be us just sitting there waiting for the instance to transition to Running and it never does --- the simulated sled-agent keeps telling nexus that it's Starting: https://buildomat.eng.oxide.computer/wg/0/artefact/01JFJSP3KB6YHPRY4TCMP4VZ1A/oDh29FyjEH2gj9cSYdiP8KUpeUTrQAn8Exx9V96JPAQZDGN7/01JFJSQ30EJMFJHE7SG600K3WV/01JFJWZDNJZZNZYG1HDTNFZ5J6/test_all-75d38c4cef91c289-test_instance_failed_by_instance_watcher_automatically_reincarnates.143763.0.log?format=x-bunyan#L8491

I wonder if this is just some kinda race where the simulated sled-agent gets poked to move the instance from starting to running too early...

hawkw · 2024-12-20T22:21:14Z

Ah, it looks like here is where the start saga attempts to transition the instance to Running: https://buildomat.eng.oxide.computer/wg/0/artefact/01JFJSP3KB6YHPRY4TCMP4VZ1A/oDh29FyjEH2gj9cSYdiP8KUpeUTrQAn8Exx9V96JPAQZDGN7/01JFJSQ30EJMFJHE7SG600K3WV/01JFJWZDNJZZNZYG1HDTNFZ5J6/test_all-75d38c4cef91c289-test_instance_failed_by_instance_watcher_automatically_reincarnates.143763.0.log?format=x-bunyan#L7069

And then we poke the sim sled-agent: https://buildomat.eng.oxide.computer/wg/0/artefact/01JFJSP3KB6YHPRY4TCMP4VZ1A/oDh29FyjEH2gj9cSYdiP8KUpeUTrQAn8Exx9V96JPAQZDGN7/01JFJSQ30EJMFJHE7SG600K3WV/01JFJWZDNJZZNZYG1HDTNFZ5J6/test_all-75d38c4cef91c289-test_instance_failed_by_instance_watcher_automatically_reincarnates.143763.0.log?format=x-bunyan#L7075

And AFAICT this happens before the request to transition to Running is actually sent to the simulated sled-agent.

I guess we could fix this by poking the simulated instance every time we check for the desired state, or something, so that we don't potentially lose a state transition because we forgot to tell the simulated sled-agent to actually do it.

A possible test flake exists in the tests for instance reincarnation. This is caused by a race that occurs when the request to simulate an instance's state transition is sent to the simulated sled-agent *before* an instance-start saga sends the request to start the instance to that sled-agent. When this occurs, the simulated state transition is lost, and the test keeps waiting for it forever. See [this comment][1] for details. This commit resolves this by adding a new `instance_wait_for_simulated_transition` helper, which is identical to `instance_wait_for_state`, but with the addition of `instance_poke` requests every time the instance is not observed to be in the desired state. This is a bit of a blunt instrument, but it ensures that the simulated sled-agent will always be told to simulate a state transition after it's requested by the control plane. Hopefully fixes #6727 [1]: #6727 (comment)

hawkw · 2024-12-20T22:52:53Z

#7295 should fix this.

hawkw added the Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. label Sep 30, 2024

hawkw self-assigned this Sep 30, 2024

hawkw changed the title ~~test failure: TEST_NAME~~ test failure: integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates Sep 30, 2024

hawkw mentioned this issue Dec 20, 2024

[tests] make instance reincarnation tests less racy #7295

Merged

hawkw closed this as completed in #7295 Dec 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test failure: `integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates` #6727

test failure: `integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates` #6727

hawkw commented Sep 30, 2024

hawkw commented Sep 30, 2024

jgallagher commented Oct 4, 2024

david-crespo commented Oct 14, 2024

jgallagher commented Dec 20, 2024

hawkw commented Dec 20, 2024

hawkw commented Dec 20, 2024

hawkw commented Dec 20, 2024

test failure: integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates #6727

test failure: integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates #6727

Comments

hawkw commented Sep 30, 2024

hawkw commented Sep 30, 2024

jgallagher commented Oct 4, 2024

david-crespo commented Oct 14, 2024

jgallagher commented Dec 20, 2024

hawkw commented Dec 20, 2024

hawkw commented Dec 20, 2024

hawkw commented Dec 20, 2024

test failure: `integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates` #6727

test failure: `integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates` #6727