Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sim-sled-agent removes destroyed VMMs from its table before reporting to Nexus #7282

Open
gjcolombo opened this issue Dec 19, 2024 · 0 comments
Labels
Sled Agent Related to the Per-Sled Configuration and Management Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. Testing & Analysis Tests & Analyzers

Comments

@gjcolombo
Copy link
Contributor

A CI test run for #7211 biffed it in integration_tests::disks::test_disk_move_between_instances, specifically while trying to create and stop an instance (to which it was going to try to attach the test disk). Logs from the failed run are here: https://buildomat.eng.oxide.computer/wg/0/artefact/01JFDZKBR1TVFTDN2CQC9RQRJX/LqJImvHS2nqTNJ8xklaVdZRzeVBxPnvzKkpfU6G10PeGtRSO/01JFE0RWTP7DP7FRV6S6M8CNVT/01JFE41AJ8FS8Y7XDQQ7SEXAEC/test_all-e39a2a7c7389546b-test_disk_move_between_instances.109403.0.log?format=x-bunyan

I think this is a bug in the simulated sled agent. First, we see the new (stopping) instance's VMM go to the Destroyed state:

[10409] 2024-12-18T23:42:01.825Z	INFO	test_disk_move_between_instances (dropshot_internal): received new VMM runtime state from sled agent
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = 127.0.0.1:39975
    method = PUT
    migration_state = Migrations { migration_in: None, migration_out: None }
    propolis_id = 1d0ff3f1-e9c1-4443-bd36-efcc4200ab15
    remote_addr = 127.0.0.1:37810
    req_id = 031a3f84-0295-4c40-9c05-ad193f0293b2
    uri = /vmms/1d0ff3f1-e9c1-4443-bd36-efcc4200ab15
    vmm_state = VmmRuntimeState { state: Destroyed, gen: Generation(7), time_updated: 2024-12-18T23:42:01.723653214Z }

This HTTP request completes here:

[10584] 2024-12-18T23:42:02.596Z	INFO	test_disk_move_between_instances (dropshot_internal): request completed
    latency_us = 774448
    local_addr = 127.0.0.1:39975
    method = PUT
    remote_addr = 127.0.0.1:37810
    req_id = 031a3f84-0295-4c40-9c05-ad193f0293b2
    response_code = 204
    uri = /vmms/1d0ff3f1-e9c1-4443-bd36-efcc4200ab15

Between these two points the instance watcher task swoops in and asks the simulated sled agent about the VMM. The agent replies that the VMM is gone; the watcher interprets this as a gone-missing VMM that should be moved to the Failed state:

[10417] 2024-12-18T23:42:01.828Z	INFO	test_disk_move_between_instances (dropshot): request completed
    error_message_external = Not Found
    error_message_internal = VMM with ID 1d0ff3f1-e9c1-4443-bd36-efcc4200ab15 not found
    latency_us = 1441
    local_addr = [::1]:38575
    method = GET
    remote_addr = [::1]:38190
    req_id = 1a1447d9-9e89-41df-8845-a41fd655fd75
    response_code = 404
    sled_id = b6d65341-167c-41df-9b5c-41cded99c229
    uri = /vmms/1d0ff3f1-e9c1-4443-bd36-efcc4200ab15/state
[10419] 2024-12-18T23:42:01.829Z	INFO	test_disk_move_between_instances (BackgroundTasks): sled-agent error indicates that this instance's VMM has failed!
    background_task = instance_watcher
    error = Error Response: status: 404 Not Found; headers: {"content-type": "application/json", "x-request-id": "1a1447d9-9e89-41df-8845-a41fd655fd75", "content-length": "120", "date": "Wed, 18 Dec 2024 23:42:01 GMT"}; value: Error { error_code: Some("NO_SUCH_INSTANCE"), message: "Not Found", request_id: "1a1447d9-9e89-41df-8845-a41fd655fd75" }
    instance_id = 40470f10-206e-4a26-be87-96e862729da8
    sled_id = b6d65341-167c-41df-9b5c-41cded99c229
    vmm_id = 1d0ff3f1-e9c1-4443-bd36-efcc4200ab15
[10421] 2024-12-18T23:42:01.829Z	INFO	test_disk_move_between_instances (BackgroundTasks): received new VMM runtime state from sled agent
    background_task = instance_watcher
    instance_id = 40470f10-206e-4a26-be87-96e862729da8
    migration_state = Migrations { migration_in: None, migration_out: None }
    propolis_id = 1d0ff3f1-e9c1-4443-bd36-efcc4200ab15
    sled_id = b6d65341-167c-41df-9b5c-41cded99c229
    vmm_id = 1d0ff3f1-e9c1-4443-bd36-efcc4200ab15
    vmm_state = VmmRuntimeState { state: Failed, gen: Generation(7), time_updated: 2024-12-18T23:42:01.829357055Z }

Later we see an instance updater saga dutifully move the instance to Failed:

[10833] 2024-12-18T23:42:03.157Z	INFO	test_disk_move_between_instances (SagaExecutor): instance update: committed update new runtime state!
    did_unlock = true
    instance_id = 40470f10-206e-4a26-be87-96e862729da8
    new_runtime = InstanceRuntimeState { time_updated: 2024-12-18T23:42:02.806634751Z, gen: Generation(Generation(4)), propolis_id: None, dst_propolis_id: None, migration_id: None, nexus_state: Failed, time_last_auto_restarted: None }
    saga_id = 6cf498e3-bf25-40a8-bc9a-51a26a00c4ae
    saga_name = instance-update

Then the auto-restart task sends a new start request for the instance. Since the instance never actually comes to rest in the Stopped state, the part of the test that waits for this times out and the test fails.

RFD 486 section 6.1 describes how sled agents are supposed to avoid this problem:

Before it can talk to a sled about a VMM, Nexus needs to read from CRDB to find the VMM and sled of interest. Nexus and sled agent must therefore be careful to avoid the following kind of time-of-check-to-time-of-use problem:

1. A user-initiated Nexus operation (e.g. instance stop) reads that the target instance is running in a VMM on some sled
2. Propolis reaches the Destroyed state and notifies sled agent
3. Sled agent observes the Destroyed state and removes the VMM from its tracking table
4. The Nexus task from step 1 calls sled agent’s "ensure instance state" API
5. Sled agent returns "instance not present"
6. Nexus concludes that the VMM has failed when it stopped normally

This can be fixed by having sled agent remove instances from its tracking table only after publishing the final Destroyed state transition to Nexus. This guarantees that if a caller observed a not-Destroyed VMM, and then calls sled agent after the VMM became Destroyed, then the caller’s prior observation has an out-of-date generation number:

1. Nexus reads that a VMM with generation 20 is running on some sled
2. Propolis reaches the Destroyed state and notifies sled agent
3. Sled agent sends updated VMM state to Nexus
4. Nexus advances the VMM record to generation 21 and records that it’s Destroyed
5. Sled agent removes the VMM from its table
6. The Nexus task from step 1 gets "instance not present" back from the sled agent
7. The same Nexus task tries to write that the VMM has Failed at generation 21; this write fails because the VMM was already advanced to that generation

If the rules had been followed here, the "state = destroyed, gen = 7" update from the sled agent would have been committed to CRDB before the agent began to return that the VMM was gone. That would have caused the watcher task's "state = failed, gen = 7" update not to be written, and the next updater task would then have moved the instance to Stopped.

The problem is that simulated sled agent doesn't follow these rules. Instead, when it updates an object's state, it takes its collection lock, removes the object from the collection, and then decides if the object is ready to destroy. If it is, it drops the lock and then sends a state change notification:

let (new_state, to_destroy) = {
// The object must be present in `objects` because it only gets
// removed when it comes to rest in the "Destroyed" state, but
// we can only get here if there's an asynchronous state
// transition desired.
//
// We do as little as possible with the lock held. In
// particular, we want to finish this work before calling out to
// notify the nexus.
let mut objects = self.objects.lock().await;
let mut object = objects.remove(&id).unwrap();
object.transition_finish();
let after = object.object.current().clone();
if matches!(mode, PokeMode::SingleStep)
|| object.object.desired().is_none()
{
should_step = false;
}
if object.object.desired().is_none()
&& object.object.ready_to_destroy()
{
info!(&self.log, "object is ready to destroy";
"object_id" => %id);
(after, Some(object))
} else {
objects.insert(id, object);
(after, None)
}
};
// Notify Nexus that the object's state has changed.
// TODO-robustness: If this fails, we need to put it on some list of
// updates to retry later.
S::notify(&self.nexus_client, &id, new_state).await.unwrap();

Not holding the collection lock while calling another service is right and just, but if sim-sled-agent is going to do this for VMMs, it needs to wait to remove them from the collection until after the relevant notification is sent.

@gjcolombo gjcolombo added Sled Agent Related to the Per-Sled Configuration and Management Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. Testing & Analysis Tests & Analyzers labels Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Sled Agent Related to the Per-Sled Configuration and Management Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. Testing & Analysis Tests & Analyzers
Projects
None yet
Development

No branches or pull requests

1 participant