Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test failure: integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates #6727

Closed
hawkw opened this issue Sep 30, 2024 · 7 comments · Fixed by #7295
Assignees
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.

Comments

@hawkw
Copy link
Member

hawkw commented Sep 30, 2024

On branch #6726 commit b3346d9, I saw this test failure:

running 1 test
test integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates has been running for over 60 seconds
test integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates ... FAILED

failures:

failures:
    integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 365 filtered out; finished in 164.25s


--- STDERR:              omicron-nexus::test_all integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates ---
log file: /tmp/test_all-afb586eba2d02dc7-test_instance_failed_by_instance_watcher_automatically_reincarnates.2148721.0.log
note: configured to log to "/tmp/test_all-afb586eba2d02dc7-test_instance_failed_by_instance_watcher_automatically_reincarnates.2148721.0.log"
DB URL: postgresql://root@[::1]:38623/omicron?sslmode=disable
DB address: [::1]:38623
log file: /tmp/test_all-afb586eba2d02dc7-test_instance_failed_by_instance_watcher_automatically_reincarnates.2148721.2.log
note: configured to log to "/tmp/test_all-afb586eba2d02dc7-test_instance_failed_by_instance_watcher_automatically_reincarnates.2148721.2.log"
log file: /tmp/test_all-afb586eba2d02dc7-test_instance_failed_by_instance_watcher_automatically_reincarnates.2148721.3.log
note: configured to log to "/tmp/test_all-afb586eba2d02dc7-test_instance_failed_by_instance_watcher_automatically_reincarnates.2148721.3.log"
[nexus/tests/integration_tests/instances.rs:1368:23] make_forgotten_instance(&cptestctx, "resurgam",
        InstanceAutoRestartPolicy::BestEffort).await = 26edfe76-56d4-4c9e-9420-9218e91ea268 (instance)
[nexus/tests/integration_tests/instances.rs:1377:5] nexus_test_utils::background::activate_background_task(&cptestctx.internal_client,
        "instance_watcher").await = BackgroundTask {
    current: Idle,
    description: "periodically checks instance states",
    last: Completed(
        LastResultCompleted {
            details: Object {
                "failed_checks": Object {
                    "no_such_instance": Number(1),
                },
                "incomplete_checks": Object {},
                "instance_states": Object {},
                "pruned_instances": Number(0),
                "total_instances": Number(1),
                "update_sagas_queued": Number(1),
            },
            elapsed: Duration {
                nanos: 375909468,
                secs: 1,
            },
            iteration: 2,
            reason: Signaled,
            start_time: 2024-09-30T17:30:10.979761748Z,
        },
    ),
    name: "instance_watcher",
    period: Duration {
        nanos: 0,
        secs: 30,
    },
}
[nexus/tests/integration_tests/instances.rs:1396:5] instance_wait_for_state(client, instance_id, InstanceState::Starting).await = Instance {
    identity: IdentityMetadata {
        id: 26edfe76-56d4-4c9e-9420-9218e91ea268,
        name: Name(
            "resurgam",
        ),
        description: "instance \"resurgam\"",
        time_created: 2024-09-30T17:30:06.993226Z,
        time_modified: 2024-09-30T17:30:06.993226Z,
    },
    project_id: cfb19f9d-8b7f-49d9-a6e5-7fc799bc6be5,
    ncpus: InstanceCpuCount(
        4,
    ),
    memory: ByteCount(
        1073741824,
    ),
    hostname: "the-host",
    runtime: InstanceRuntimeState {
        run_state: Starting,
        time_run_state_updated: 2024-09-30T17:30:12.371281Z,
        time_last_auto_restarted: Some(
            2024-09-30T17:30:12.419981Z,
        ),
    },
    auto_restart_status: InstanceAutoRestartStatus {
        enabled: true,
        cooldown_expiration: Some(
            2024-09-30T18:30:12.419981Z,
        ),
    },
}
[nexus/tests/integration_tests/instances.rs:1402:5] instance_wait_for_vmm_registration(cptestctx, &instance_id).await = ()
[nexus/tests/integration_tests/instances.rs:1404:5] instance_simulate(nexus, &instance_id).await = ()
thread 'integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates' panicked at nexus/tests/integration_tests/instances.rs:5382:19:
instance 26edfe76-56d4-4c9e-9420-9218e91ea268 did not transition to Running after 120s: timed out after 120.235074552s
stack backtrace:
   0: rust_begin_unwind
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:652:5
   1: core::panicking::panic_fmt
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/panicking.rs:72:14
   2: {async_fn#0}
             at ./tests/integration_tests/instances.rs:5382:19
   3: {async_fn#0}
             at ./tests/integration_tests/instances.rs:5333:6
   4: {async_fn#0}
             at ./tests/integration_tests/instances.rs:1407:14
   5: {async_block#0}
             at ./tests/integration_tests/instances.rs:1362:1
   6: poll<&mut dyn core::future::future::Future<Output=()>>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/future/future.rs:123:9
   7: poll<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/future/future.rs:123:9
   8: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:673:57
   9: with_budget<core::task::poll::Poll<()>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>>
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/coop.rs:107:5
  10: budget<core::task::poll::Poll<()>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>>
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/coop.rs:73:5
  11: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:673:25
  12: tokio::runtime::scheduler::current_thread::Context::enter
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:412:19
  13: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:672:36
  14: tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:751:68
  15: tokio::runtime::context::scoped::Scoped<T>::set
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context/scoped.rs:40:9
  16: tokio::runtime::context::set_scheduler::{{closure}}
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context.rs:180:26
  17: try_with<tokio::runtime::context::Context, tokio::runtime::context::set_scheduler::{closure_env#0}<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<()>), tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>, core::option::Option<()>>>, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<()>)>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/thread/local.rs:283:12
  18: std::thread::local::LocalKey<T>::with
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/thread/local.rs:260:9
  19: tokio::runtime::context::set_scheduler
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context.rs:180:9
  20: tokio::runtime::scheduler::current_thread::CoreGuard::enter
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:751:27
  21: tokio::runtime::scheduler::current_thread::CoreGuard::block_on
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:660:19
  22: {closure#0}<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:180:28
  23: tokio::runtime::context::runtime::enter_runtime
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context/runtime.rs:65:16
  24: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:168:9
  25: tokio::runtime::runtime::Runtime::block_on_inner
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/runtime.rs:361:47
  26: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/runtime.rs:335:13
  27: test_instance_failed_by_instance_watcher_automatically_reincarnates
             at ./tests/integration_tests/instances.rs:1362:1
  28: test_all::integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates::{{closure}}
             at ./tests/integration_tests/instances.rs:1365:2
  29: core::ops::function::FnOnce::call_once
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5
  30: core::ops::function::FnOnce::call_once
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: "/tmp/.tmpeagOTP"
        If you would like to access the database for debugging, run the following:

        # Run the database
        cargo xtask db-dev run --no-populate --store-dir "/tmp/.tmpeagOTP/data"
        # Access the database. Note the port may change if you run multiple databases.
        cockroach sql --host=localhost:32221 --insecure
WARN: dropped ClickHouse process without cleaning it up first (there may still be a child process running (PID 2149884) and a temporary directory leaked, /tmp/test_all-afb586eba2d02dc7-test_instance_failed_by_instance_watcher_automatically_reincarnates.2148721.1-clickhouse-u4U4l4)
WARN: dropped DendriteInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: dendrite temporary directory leaked: /tmp/.tmpCeuyEZ
WARN: dropped DendriteInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: dendrite temporary directory leaked: /tmp/.tmpLrWu8z
WARN: dropped MgdInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: mgd temporary directory leaked: /tmp/.tmpAEq29n
WARN: dropped MgdInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: mgd temporary directory leaked: /tmp/.tmpbVjJR5

   Canceling due to test failure: 23 tests still running

Reran the test and it appears to pass. I wonder if it just took slightly longer than 2 minutes for the instance to transition back from Starting -> Running? Feels like a pretty long time though...

Log file:
test_all-afb586eba2d02dc7-test_instance_failed_by_instance_watcher_automatically_reincarnates.2148721.0.log

NOTE: Consider attaching any log files produced by the test.

@hawkw hawkw added the Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. label Sep 30, 2024
@hawkw hawkw self-assigned this Sep 30, 2024
@hawkw
Copy link
Member Author

hawkw commented Sep 30, 2024

This is my test, so I'm gonna try to figure out what's gone wrong here once once I finish more urgent work.

@hawkw hawkw changed the title test failure: TEST_NAME test failure: integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates Sep 30, 2024
@jgallagher
Copy link
Contributor

Saw this on https://github.com/oxidecomputer/omicron/runs/31050364405 / https://buildomat.eng.oxide.computer/wg/0/details/01J9A281VZ5AWSENGSBPS2SZBM/YgsiBVpZeZs8xKwEs3NG8vJWRHaj6HRg4P2sOa8fSLBWOjXN/01J9A28XVZQ35MEGH8KPZZ9Q26. At a glance it looks basically the same (other than a new policy field in the logged structs?), so not sure there's any new info there, but thought I'd note it in case that's wrong.

@hawkw
Copy link
Member Author

hawkw commented Dec 20, 2024

Looking at @jgallagher's logs:

Hmm, it looks like we are correctly creating an instance_start saga for the instance: https://buildomat.eng.oxide.computer/wg/0/artefact/01JFJSP3KB6YHPRY4TCMP4VZ1A/oDh29FyjEH2gj9cSYdiP8KUpeUTrQAn8Exx9V96JPAQZDGN7/01JFJSQ30EJMFJHE7SG600K3WV/01JFJWZDNJZZNZYG1HDTNFZ5J6/test_all-75d38c4cef91c289-test_instance_failed_by_instance_watcher_automatically_reincarnates.143763.0.log?format=x-bunyan#L6313

And, the start saga has successfully told the simulated sled-agent to start that instance: https://buildomat.eng.oxide.computer/wg/0/artefact/01JFJSP3KB6YHPRY4TCMP4VZ1A/oDh29FyjEH2gj9cSYdiP8KUpeUTrQAn8Exx9V96JPAQZDGN7/01JFJSQ30EJMFJHE7SG600K3WV/01JFJWZDNJZZNZYG1HDTNFZ5J6/test_all-75d38c4cef91c289-test_instance_failed_by_instance_watcher_automatically_reincarnates.143763.0.log?format=x-bunyan#L6938

The start saga completes successfully: https://buildomat.eng.oxide.computer/wg/0/artefact/01JFJSP3KB6YHPRY4TCMP4VZ1A/oDh29FyjEH2gj9cSYdiP8KUpeUTrQAn8Exx9V96JPAQZDGN7/01JFJSQ30EJMFJHE7SG600K3WV/01JFJWZDNJZZNZYG1HDTNFZ5J6/test_all-75d38c4cef91c289-test_instance_failed_by_instance_watcher_automatically_reincarnates.143763.0.log?format=x-bunyan#L7170

This log line feels kinda weird, given that it says the background task completed "with errors", but it looks like there were no errors: https://buildomat.eng.oxide.computer/wg/0/artefact/01JFJSP3KB6YHPRY4TCMP4VZ1A/oDh29FyjEH2gj9cSYdiP8KUpeUTrQAn8Exx9V96JPAQZDGN7/01JFJSQ30EJMFJHE7SG600K3WV/01JFJWZDNJZZNZYG1HDTNFZ5J6/test_all-75d38c4cef91c289-test_instance_failed_by_instance_watcher_automatically_reincarnates.143763.0.log?format=x-bunyan#L7206

And then, the entire rest of the test seems to be us just sitting there waiting for the instance to transition to Running and it never does --- the simulated sled-agent keeps telling nexus that it's Starting: https://buildomat.eng.oxide.computer/wg/0/artefact/01JFJSP3KB6YHPRY4TCMP4VZ1A/oDh29FyjEH2gj9cSYdiP8KUpeUTrQAn8Exx9V96JPAQZDGN7/01JFJSQ30EJMFJHE7SG600K3WV/01JFJWZDNJZZNZYG1HDTNFZ5J6/test_all-75d38c4cef91c289-test_instance_failed_by_instance_watcher_automatically_reincarnates.143763.0.log?format=x-bunyan#L8491

I wonder if this is just some kinda race where the simulated sled-agent gets poked to move the instance from starting to running too early...

@hawkw
Copy link
Member Author

hawkw commented Dec 20, 2024

Ah, it looks like here is where the start saga attempts to transition the instance to Running: https://buildomat.eng.oxide.computer/wg/0/artefact/01JFJSP3KB6YHPRY4TCMP4VZ1A/oDh29FyjEH2gj9cSYdiP8KUpeUTrQAn8Exx9V96JPAQZDGN7/01JFJSQ30EJMFJHE7SG600K3WV/01JFJWZDNJZZNZYG1HDTNFZ5J6/test_all-75d38c4cef91c289-test_instance_failed_by_instance_watcher_automatically_reincarnates.143763.0.log?format=x-bunyan#L7069

And then we poke the sim sled-agent: https://buildomat.eng.oxide.computer/wg/0/artefact/01JFJSP3KB6YHPRY4TCMP4VZ1A/oDh29FyjEH2gj9cSYdiP8KUpeUTrQAn8Exx9V96JPAQZDGN7/01JFJSQ30EJMFJHE7SG600K3WV/01JFJWZDNJZZNZYG1HDTNFZ5J6/test_all-75d38c4cef91c289-test_instance_failed_by_instance_watcher_automatically_reincarnates.143763.0.log?format=x-bunyan#L7075

And AFAICT this happens before the request to transition to Running is actually sent to the simulated sled-agent.

I guess we could fix this by poking the simulated instance every time we check for the desired state, or something, so that we don't potentially lose a state transition because we forgot to tell the simulated sled-agent to actually do it.

hawkw added a commit that referenced this issue Dec 20, 2024
A possible test flake exists in the tests for instance reincarnation.
This is caused by a race that occurs when the request to simulate an
instance's state transition is sent to the simulated sled-agent *before*
an instance-start saga sends the request to start the instance to that
sled-agent. When this occurs, the simulated state transition is lost,
and the test keeps waiting for it forever. See [this comment][1] for
details.

This commit resolves this by adding a new
`instance_wait_for_simulated_transition` helper, which is identical to
`instance_wait_for_state`, but with the addition of `instance_poke`
requests every time the instance is not observed to be in the desired
state. This is a bit of a blunt instrument, but it ensures that the
simulated sled-agent will always be told to simulate a state transition
after it's requested by the control plane.

Hopefully fixes #6727

[1]: #6727 (comment)
hawkw added a commit that referenced this issue Dec 20, 2024
A possible test flake exists in the tests for instance reincarnation.
This is caused by a race that occurs when the request to simulate an
instance's state transition is sent to the simulated sled-agent *before*
an instance-start saga sends the request to start the instance to that
sled-agent. When this occurs, the simulated state transition is lost,
and the test keeps waiting for it forever. See [this comment][1] for
details.

This commit resolves this by adding a new
`instance_wait_for_simulated_transition` helper, which is identical to
`instance_wait_for_state`, but with the addition of `instance_poke`
requests every time the instance is not observed to be in the desired
state. This is a bit of a blunt instrument, but it ensures that the
simulated sled-agent will always be told to simulate a state transition
after it's requested by the control plane.

Hopefully fixes #6727

[1]: #6727 (comment)
@hawkw
Copy link
Member Author

hawkw commented Dec 20, 2024

#7295 should fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants