Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nexus] add RPW for proactively pulling instance state from sled-agents #5611

Merged
merged 73 commits into from
May 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
b4b7d7a
initial plumbing for looking up VMMs by sled agent
hawkw Apr 12, 2024
d011d6a
stub sled agent plumbing
hawkw Apr 12, 2024
8e5f758
more wip
hawkw Apr 17, 2024
c4f9654
wip
hawkw Apr 22, 2024
a36b877
wip
hawkw Apr 23, 2024
7422676
wip
hawkw Apr 24, 2024
464c4c0
tear apart the entire `instance_network` module
hawkw Apr 24, 2024
4823392
plumbing
hawkw Apr 24, 2024
ff64c08
oh that's how you're supposed to make child opctxs
hawkw Apr 24, 2024
78fd6fa
add config
hawkw Apr 24, 2024
bedf76e
mv instance_state.rs instance_watcher.rs
hawkw Apr 24, 2024
79d2af2
mv instance_state.rs instance_watcher.rs
hawkw Apr 24, 2024
1c6372b
sketch out retry stuff
hawkw Apr 24, 2024
dfde7c3
rm unused import
hawkw Apr 24, 2024
de3303b
use the `sled` table instead
hawkw Apr 25, 2024
4c68fce
whoopsie bad rebase
hawkw Apr 25, 2024
aa354e7
actually use sleds table correctly
hawkw Apr 25, 2024
eacb233
wip
hawkw Apr 25, 2024
0df165d
new thing
hawkw Apr 25, 2024
06950fe
remove unused import
hawkw Apr 25, 2024
c49780d
rm max retries
hawkw Apr 29, 2024
5ed818f
fix config tests
hawkw Apr 29, 2024
fa5fd02
wip metrics
hawkw Apr 30, 2024
9263bcb
more metric plumbing
hawkw Apr 30, 2024
ddb5792
redo metrics
hawkw Apr 30, 2024
5d18a69
keep lock alive
hawkw Apr 30, 2024
bbc4fd0
whoops
hawkw Apr 30, 2024
dcc9fe9
docs etc
hawkw Apr 30, 2024
30fd552
omdb test update
hawkw Apr 30, 2024
4407df9
additional renamening
hawkw May 1, 2024
3efa0ad
add nicer omdb support
hawkw May 1, 2024
d840e30
Merge branch 'main' into eliza/instance-watch-rpw
hawkw May 1, 2024
224c66a
redo metric structure per @bnaecker's feedback
hawkw May 2, 2024
03de700
update omdb
hawkw May 2, 2024
3f85ebe
most of @bnaecker and @smklein's style suggestions
hawkw May 2, 2024
610acea
back out unneeded changes
hawkw May 2, 2024
6ff3d97
Merge branch 'main' into eliza/instance-watch-rpw
hawkw May 2, 2024
0e6ed21
include project and silo IDs; metrics tweaks
hawkw May 3, 2024
fc7899d
minor style embetterments
hawkw May 3, 2024
49976d2
use fleet authz in instance/vmm query
hawkw May 6, 2024
70b8485
docs
hawkw May 6, 2024
6f0216d
separate instance state and reason in metrics
hawkw May 6, 2024
c1661e9
rm unused code
hawkw May 6, 2024
6caf120
Merge branch 'main' into eliza/instance-watch-rpw
hawkw May 6, 2024
70f2b3f
make instance-watcher query errors more obvious
hawkw May 6, 2024
59e1bc3
add indices so that the vmms-by-sled query works
hawkw May 6, 2024
44ef386
serde keys must be strings :/
hawkw May 6, 2024
23eb079
actually populate sled-agent-sim methods
hawkw May 6, 2024
6b7832a
actually populate sled-agent-sim methods
hawkw May 6, 2024
a207c62
fix wrong counts in omdb output
hawkw May 6, 2024
1aa3cc1
prettify OMDB output
hawkw May 6, 2024
e831a05
put back eaten instance_watcher period configs
hawkw May 8, 2024
10bc937
oh, so *that's* how you add to the schema
hawkw May 8, 2024
065b149
add a test for instance watcher timeseries
hawkw May 8, 2024
75dac39
whoops i meant to fix that one too
hawkw May 8, 2024
8392b12
unbreak expected metric fields
hawkw May 8, 2024
c3694fc
nicer code for finding instance timeserieses
hawkw May 8, 2024
9f83322
Merge branch 'main' into eliza/instance-watch-rpw
hawkw May 8, 2024
40d553b
WELL IF IT ISN'T THE CONSEQUENCES OF MY OWN PRS
hawkw May 8, 2024
f355b75
fix schema diff
hawkw May 8, 2024
d4fdfff
also update omdb tests
hawkw May 8, 2024
96d7b71
allocate way fewer strings with `Cow`
hawkw May 8, 2024
bd37330
fixup stray comment
hawkw May 8, 2024
23c01d3
blegh
hawkw May 8, 2024
c96c7d7
clean up target construction slightly
hawkw May 9, 2024
d703885
prune instances only *after* checks have completed
hawkw May 9, 2024
d1d3fe2
clone slightly less stuff
hawkw May 9, 2024
8be9a8d
less flaky way of activating instance watcher in test
hawkw May 9, 2024
406e69d
simulate more instance state transitions in test
hawkw May 9, 2024
08e01ad
add nexus UUID to check metrics
hawkw May 10, 2024
c909fc7
emit `state: "unknown"` for incomplete checks
hawkw May 10, 2024
d6fe22a
may as well throw in the rack ID too...
hawkw May 10, 2024
6e9155e
whoops i forgot to touch the instances
hawkw May 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

93 changes: 88 additions & 5 deletions dev-tools/omdb/src/bin/omdb/nexus.rs
Original file line number Diff line number Diff line change
Expand Up @@ -889,8 +889,95 @@ fn print_task_details(bgtask: &BackgroundTask, details: &serde_json::Value) {
);
}
};
} else if name == "instance_watcher" {
#[derive(Deserialize)]
struct TaskSuccess {
/// total number of instances checked
total_instances: usize,

/// number of stale instance metrics that were deleted
pruned_instances: usize,

/// instance states from completed checks.
///
/// this is a mapping of stringified instance states to the number
/// of instances in that state. these stringified states correspond
/// to the `state` field recorded by the instance watcher's
/// `virtual_machine:check` timeseries with the `healthy` field set
/// to `true`. any changes to the instance state type which cause it
/// to print differently will be counted as a distinct state.
instance_states: BTreeMap<String, usize>,

/// instance check failures.
///
/// this is a mapping of stringified instance check failure reasons
/// to the number of instances with checks that failed for that
/// reason. these stringified failure reasons correspond to the
/// `state` field recorded by the instance watcher's
/// `virtual_machine:check` timeseries with the `healthy` field set
/// to `false`. any changes to the instance state type which cause
/// it to print differently will be counted as a distinct failure
/// reason.
failed_checks: BTreeMap<String, usize>,

/// instance checks that could not be completed successfully.
///
/// this is a mapping of stringified instance check errors
/// to the number of instance checks that were not completed due to
/// that error. these stringified errors correspond to the `reason `
/// field recorded by the instance watcher's
/// `virtual_machine:incomplete_check` timeseries. any changes to
/// the check error type which cause it to print
/// differently will be counted as a distinct check error.
incomplete_checks: BTreeMap<String, usize>,
}

match serde_json::from_value::<TaskSuccess>(details.clone()) {
Err(error) => eprintln!(
"warning: failed to interpret task details: {:?}: {:?}",
error, details
),
Ok(TaskSuccess {
total_instances,
pruned_instances,
instance_states,
failed_checks,
incomplete_checks,
}) => {
let total_successes: usize = instance_states.values().sum();
let total_failures: usize = failed_checks.values().sum();
let total_incomplete: usize = incomplete_checks.values().sum();
println!(" total instances checked: {total_instances}",);
println!(
" checks completed: {}",
total_successes + total_failures
);
println!(" successful checks: {total_successes}",);
for (state, count) in &instance_states {
println!(" -> {count} instances {state}")
}

println!(" failed checks: {total_failures}");
for (failure, count) in &failed_checks {
println!(" -> {count} {failure}")
}
println!(
" checks that could not be completed: {total_incomplete}",
);
for (error, count) in &incomplete_checks {
println!(" -> {count} {error} errors")
}
println!(
" stale instance metrics pruned: {pruned_instances}"
);
}
};
} else if name == "service_firewall_rule_propagation" {
match serde_json::from_value::<serde_json::Value>(details.clone()) {
Err(error) => eprintln!(
"warning: failed to interpret task details: {:?}: {:?}",
error, details
),
Ok(serde_json::Value::Object(map)) => {
if !map.is_empty() {
eprintln!(
Expand All @@ -902,11 +989,7 @@ fn print_task_details(bgtask: &BackgroundTask, details: &serde_json::Value) {
Ok(val) => {
eprintln!(" unexpected return value from task: {:?}", val)
}
Err(error) => eprintln!(
"warning: failed to interpret task details: {:?}: {:?}",
error, details
),
}
};
} else {
println!(
"warning: unknown background task: {:?} \
Expand Down
12 changes: 12 additions & 0 deletions dev-tools/omdb/tests/env.out
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,10 @@ task: "external_endpoints"
on each one


task: "instance_watcher"
periodically checks instance states


task: "inventory_collection"
collects hardware and software inventory data from the whole system

Expand Down Expand Up @@ -179,6 +183,10 @@ task: "external_endpoints"
on each one


task: "instance_watcher"
periodically checks instance states


task: "inventory_collection"
collects hardware and software inventory data from the whole system

Expand Down Expand Up @@ -273,6 +281,10 @@ task: "external_endpoints"
on each one


task: "instance_watcher"
periodically checks instance states


task: "inventory_collection"
collects hardware and software inventory data from the whole system

Expand Down
16 changes: 16 additions & 0 deletions dev-tools/omdb/tests/successes.out
Original file line number Diff line number Diff line change
Expand Up @@ -249,6 +249,10 @@ task: "external_endpoints"
on each one


task: "instance_watcher"
periodically checks instance states


task: "inventory_collection"
collects hardware and software inventory data from the whole system

Expand Down Expand Up @@ -396,6 +400,18 @@ task: "external_endpoints"

TLS certificates: 0

task: "instance_watcher"
configured period: every 30s
currently executing: no
last completed activation: <REDACTED ITERATIONS>, triggered by an explicit signal
started at <REDACTED TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
total instances checked: 0
checks completed: 0
successful checks: 0
failed checks: 0
checks that could not be completed: 0
stale instance metrics pruned: 0

task: "inventory_collection"
configured period: every 10m
currently executing: no
Expand Down
15 changes: 15 additions & 0 deletions nexus-config/src/nexus_config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -375,6 +375,8 @@ pub struct BackgroundTaskConfig {
pub switch_port_settings_manager: SwitchPortSettingsManagerConfig,
/// configuration for region replacement task
pub region_replacement: RegionReplacementConfig,
/// configuration for instance watcher task
pub instance_watcher: InstanceWatcherConfig,
/// configuration for service VPC firewall propagation task
pub service_firewall_propagation: ServiceFirewallPropagationConfig,
}
Expand Down Expand Up @@ -521,6 +523,14 @@ pub struct RegionReplacementConfig {
pub period_secs: Duration,
}

#[serde_as]
#[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize)]
pub struct InstanceWatcherConfig {
/// period (in seconds) for periodic activations of this background task
#[serde_as(as = "DurationSeconds<u64>")]
pub period_secs: Duration,
}

#[serde_as]
#[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize)]
pub struct ServiceFirewallPropagationConfig {
Expand Down Expand Up @@ -765,6 +775,7 @@ mod test {
sync_service_zone_nat.period_secs = 30
switch_port_settings_manager.period_secs = 30
region_replacement.period_secs = 30
instance_watcher.period_secs = 30
service_firewall_propagation.period_secs = 300
[default_region_allocation_strategy]
type = "random"
Expand Down Expand Up @@ -894,6 +905,9 @@ mod test {
region_replacement: RegionReplacementConfig {
period_secs: Duration::from_secs(30),
},
instance_watcher: InstanceWatcherConfig {
period_secs: Duration::from_secs(30),
},
service_firewall_propagation:
ServiceFirewallPropagationConfig {
period_secs: Duration::from_secs(300),
Expand Down Expand Up @@ -964,6 +978,7 @@ mod test {
sync_service_zone_nat.period_secs = 30
switch_port_settings_manager.period_secs = 30
region_replacement.period_secs = 30
instance_watcher.period_secs = 30
service_firewall_propagation.period_secs = 300
[default_region_allocation_strategy]
type = "random"
Expand Down
1 change: 1 addition & 0 deletions nexus/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ itertools.workspace = true
macaddr.workspace = true
# Not under "dev-dependencies"; these also need to be implemented for
# integration tests.
nexus-client.workspace = true
nexus-config.workspace = true
nexus-networking.workspace = true
nexus-test-interface.workspace = true
Expand Down
2 changes: 2 additions & 0 deletions nexus/db-model/src/schema.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1770,3 +1770,5 @@ allow_tables_to_appear_in_same_query!(volume, virtual_provisioning_resource);
allow_tables_to_appear_in_same_query!(ssh_key, instance_ssh_key, instance);
joinable!(instance_ssh_key -> ssh_key (ssh_key_id));
joinable!(instance_ssh_key -> instance (instance_id));

allow_tables_to_appear_in_same_query!(sled, sled_instance);
3 changes: 2 additions & 1 deletion nexus/db-model/src/schema_versions.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ use std::collections::BTreeMap;
///
/// This must be updated when you change the database schema. Refer to
/// schema/crdb/README.adoc in the root of this repository for details.
pub const SCHEMA_VERSION: SemverVersion = SemverVersion::new(59, 0, 0);
pub const SCHEMA_VERSION: SemverVersion = SemverVersion::new(60, 0, 0);

/// List of all past database schema versions, in *reverse* order
///
Expand All @@ -29,6 +29,7 @@ static KNOWN_VERSIONS: Lazy<Vec<KnownVersion>> = Lazy::new(|| {
// | leaving the first copy as an example for the next person.
// v
// KnownVersion::new(next_int, "unique-dirname-with-the-sql-files"),
KnownVersion::new(60, "add-lookup-vmm-by-sled-id-index"),
KnownVersion::new(59, "enforce-first-as-default"),
KnownVersion::new(58, "insert-default-allowlist"),
KnownVersion::new(57, "add-allowed-source-ips"),
Expand Down
6 changes: 6 additions & 0 deletions nexus/db-model/src/sled_instance.rs
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,9 @@ impl From<SledInstance> for views::SledInstance {
}
}
}

impl SledInstance {
pub fn instance_id(&self) -> Uuid {
self.identity.id
}
}
56 changes: 56 additions & 0 deletions nexus/db-queries/src/db/datastore/instance.rs
Original file line number Diff line number Diff line change
Expand Up @@ -22,18 +22,22 @@ use crate::db::model::Instance;
use crate::db::model::InstanceRuntimeState;
use crate::db::model::Name;
use crate::db::model::Project;
use crate::db::model::Sled;
use crate::db::model::Vmm;
use crate::db::pagination::paginated;
use crate::db::update_and_check::UpdateAndCheck;
use crate::db::update_and_check::UpdateStatus;
use async_bb8_diesel::AsyncRunQueryDsl;
use chrono::Utc;
use diesel::prelude::*;
use nexus_db_model::ApplySledFilterExt;
use nexus_db_model::Disk;
use nexus_db_model::VmmRuntimeState;
use nexus_types::deployment::SledFilter;
use omicron_common::api;
use omicron_common::api::external::http_pagination::PaginatedBy;
use omicron_common::api::external::CreateResult;
use omicron_common::api::external::DataPageParams;
use omicron_common::api::external::DeleteResult;
use omicron_common::api::external::Error;
use omicron_common::api::external::ListResultVec;
Expand Down Expand Up @@ -385,6 +389,58 @@ impl DataStore {
Ok((instance_updated, vmm_updated))
}

/// Lists all instances on in-service sleds with active Propolis VMM
/// processes, returning the instance along with the VMM on which it's
/// running, the sled on which the VMM is running, and the project that owns
/// the instance.
///
/// The query performed by this function is paginated by the sled's UUID.
pub async fn instance_and_vmm_list_by_sled_agent(
&self,
opctx: &OpContext,
pagparams: &DataPageParams<'_, Uuid>,
) -> ListResultVec<(Sled, Instance, Vmm, Project)> {
use crate::db::schema::{
instance::dsl as instance_dsl, project::dsl as project_dsl,
sled::dsl as sled_dsl, vmm::dsl as vmm_dsl,
};
opctx.authorize(authz::Action::Read, &authz::FLEET).await?;
let conn = self.pool_connection_authorized(opctx).await?;

let result = paginated(sled_dsl::sled, sled_dsl::id, pagparams)
.filter(sled_dsl::time_deleted.is_null())
.sled_filter(SledFilter::InService)
.inner_join(
vmm_dsl::vmm
.on(vmm_dsl::sled_id
.eq(sled_dsl::id)
.and(vmm_dsl::time_deleted.is_null()))
.inner_join(
instance_dsl::instance
.on(instance_dsl::id
.eq(vmm_dsl::instance_id)
.and(instance_dsl::time_deleted.is_null()))
.inner_join(
project_dsl::project.on(project_dsl::id
.eq(instance_dsl::project_id)
.and(project_dsl::time_deleted.is_null())),
),
),
)
.sled_filter(SledFilter::InService)
.select((
Sled::as_select(),
Instance::as_select(),
Vmm::as_select(),
Project::as_select(),
))
.load_async::<(Sled, Instance, Vmm, Project)>(&*conn)
.await
.map_err(|e| public_error_from_diesel(e, ErrorHandler::Server))?;

Ok(result)
}

pub async fn project_delete_instance(
&self,
opctx: &OpContext,
Expand Down
2 changes: 2 additions & 0 deletions nexus/examples/config.toml
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,8 @@ blueprints.period_secs_execute = 60
sync_service_zone_nat.period_secs = 30
switch_port_settings_manager.period_secs = 30
region_replacement.period_secs = 30
# How frequently to query the status of active instances.
instance_watcher.period_secs = 30
service_firewall_propagation.period_secs = 300

[default_region_allocation_strategy]
Expand Down
Loading
Loading