sim-sled-agent removes destroyed VMMs from its table before reporting to Nexus #7282
Labels
Sled Agent
Related to the Per-Sled Configuration and Management
Test Flake
Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
Testing & Analysis
Tests & Analyzers
A CI test run for #7211 biffed it in
integration_tests::disks::test_disk_move_between_instances
, specifically while trying to create and stop an instance (to which it was going to try to attach the test disk). Logs from the failed run are here: https://buildomat.eng.oxide.computer/wg/0/artefact/01JFDZKBR1TVFTDN2CQC9RQRJX/LqJImvHS2nqTNJ8xklaVdZRzeVBxPnvzKkpfU6G10PeGtRSO/01JFE0RWTP7DP7FRV6S6M8CNVT/01JFE41AJ8FS8Y7XDQQ7SEXAEC/test_all-e39a2a7c7389546b-test_disk_move_between_instances.109403.0.log?format=x-bunyanI think this is a bug in the simulated sled agent. First, we see the new (stopping) instance's VMM go to the Destroyed state:
This HTTP request completes here:
Between these two points the instance watcher task swoops in and asks the simulated sled agent about the VMM. The agent replies that the VMM is gone; the watcher interprets this as a gone-missing VMM that should be moved to the Failed state:
Later we see an instance updater saga dutifully move the instance to Failed:
Then the auto-restart task sends a new start request for the instance. Since the instance never actually comes to rest in the Stopped state, the part of the test that waits for this times out and the test fails.
RFD 486 section 6.1 describes how sled agents are supposed to avoid this problem:
If the rules had been followed here, the "state = destroyed, gen = 7" update from the sled agent would have been committed to CRDB before the agent began to return that the VMM was gone. That would have caused the watcher task's "state = failed, gen = 7" update not to be written, and the next updater task would then have moved the instance to Stopped.
The problem is that simulated sled agent doesn't follow these rules. Instead, when it updates an object's state, it takes its collection lock, removes the object from the collection, and then decides if the object is ready to destroy. If it is, it drops the lock and then sends a state change notification:
omicron/sled-agent/src/sim/collection.rs
Lines 238 to 274 in d337333
Not holding the collection lock while calling another service is right and just, but if sim-sled-agent is going to do this for VMMs, it needs to wait to remove them from the collection until after the relevant notification is sent.
The text was updated successfully, but these errors were encountered: