Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuring the VMM reservoir for dev setups could be easier. #4137

Merged
merged 4 commits into from
Nov 10, 2023

Conversation

papertigers
Copy link
Contributor

A potential fix for #3874
This also removes some of the asserts and panics in favor or returning errors directly.

I tested this on a physical box running helios by:

  • Setting vmm_reservoir_percentage and vmm_reservoir_size_mb and observing the error in the SMF logs
  • Setting vmm_reservoir_percentage and confirming with rsrvrctl -q that things looked correct
  • Setting vmm_reservoir_size_mb to 2048 and confirming with rsrvrctl -q that the reservoir was set to 2 GiB.
    (Note I ran rsrvrctl -s 0 between tests.

Would appreciate if you could take a look @jordanhendricks to see if it's what you had in mind.

Copy link
Contributor

@jordanhendricks jordanhendricks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! This is definitely what I was thinking for this configuration. It sounds like you were able to get some testing laps in, but let me know if you need a hand at all there.

Comment on lines 359 to 360
"cannot specify vmm_reservoir_percentage and \
vmm_reservoir_size_mb at the same time"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would probably phrase this as: "only one of vmm_reservoir_percentage and vmm_reservoir_size_mb is allowed" or something similar (instead of referring to the "same time")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 23a25d6

Comment on lines -362 to -404
Some(sz) => {
panic!("invalid requested VMM reservoir percentage: {}", sz);
Copy link
Contributor

@jordanhendricks jordanhendricks Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check was here to verify that the agent would panic if it was configured with a percentage > 100. It looks like this got moved into an error into the callee, which I think is fine, but it would be good to put testing bad values in the config in some testing notes for the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked to see if there was a test that mocked some of this so that I could pass some values to the function but it doesn't look like such a setup exists. Maybe we can use this as an excuse to jump on a bench gimlet and test this together?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, there's no unit test for that behavior. When I merged this PR initially, I just tested it ad hoc (since it's a pretty straightforward thing to test, and only matters for startup of the agent).

but yes, let's do that!

@@ -51,6 +51,9 @@ pub struct Config {
pub sidecar_revision: SidecarRevision,
/// Optional percentage of DRAM to reserve for guest memory
pub vmm_reservoir_percentage: Option<u8>,
/// Optional DRAM to reserve for guest memory in MiB (cannot be used with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe "mutually exclusive option with vmm_reservoir_percentage"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 23a25d6

ReservoirMode::Percentage(percent) => {
if !matches!(percent, 1..=99) {
return Err(Error::ReservoirConfig(format!(
"reservoir percentage of {} must be between 0 and 100",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "VMM reservoir" to be extra clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 23a25d6

_ => {
instances
.set_reservoir_size(&hardware, reservoir_mode)
.map_err(|e| {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to use the tap crate for situations like this: tap_err()

@papertigers
Copy link
Contributor Author

At some point it would be nice to mock sled_hardware::HardwareManager and add some of these as unit tests.

Testing this change.

I made sure sled-agent came up fully on the first start:

gimlet-sn06 # grep -i handoff `svcs -L sled-agent` | tail -1 | looker
20:10:50.515Z INFO SledAgent (RSS): Handoff to Nexus is complete
    file = sled-agent/src/rack_setup/service.rs:664

Reservoir using vmm_reservoir_percentage config:

gimlet-sn06 # rg "^vmm_reservoir_percentage|^vmm_reservoir_size_mb" /opt/oxide/sled-agent/pkg/config.toml
46:vmm_reservoir_percentage = 50
gimlet-sn06 # /usr/lib/rsrvrctl -q
Free KiB:       530569216
Allocated KiB:  0
Transient Allocated KiB:        0
Size limit KiB: 1013850461
Reset the reservoir:
gimlet-sn06 # svcadm disable sled-agent
gimlet-sn06 # svcs sled-agent
STATE          STIME    FMRI
disabled       20:21:56 svc:/oxide/sled-agent:default
gimlet-sn06 # /usr/lib/rsrvrctl -q
Free KiB:       530569216
Allocated KiB:  0
Transient Allocated KiB:        0
Size limit KiB: 1013850461
gimlet-sn06 # /usr/lib/rsrvrctl -s 0
gimlet-sn06 # /usr/lib/rsrvrctl -q
Free KiB:       0
Allocated KiB:  0
Transient Allocated KiB:        0
Size limit KiB: 1013850461

Reservoir using vmm_reservoir_size_mb config:

gimlet-sn06 # rg "^vmm_reservoir_percentage|^vmm_reservoir_size_mb" /opt/oxide/sled-agent/pkg/config.toml
51:vmm_reservoir_size_mb = 2048
gimlet-sn06 # /usr/lib/rsrvrctl -q
Free KiB:       2097152
Allocated KiB:  0
Transient Allocated KiB:        0
Size limit KiB: 1013850461
Reset the reservoir:
gimlet-sn06 # svcadm disable sled-agent
gimlet-sn06 # svcs sled-agent
STATE          STIME    FMRI
disabled       20:31:09 svc:/oxide/sled-agent:default
gimlet-sn06 # /usr/lib/rsrvrctl -s 0
gimlet-sn06 # /usr/lib/rsrvrctl -q
Free KiB:       0
Allocated KiB:  0
Transient Allocated KiB:        0
Size limit KiB: 1013850461

Ensure vmm_reservoir_size_mb and vmm_reservoir_percentage are mutually exclusive:

gimlet-sn06 # rg "^vmm_reservoir_percentage|^vmm_reservoir_size_mb" /opt/oxide/sled-agent/pkg/config.toml
46:vmm_reservoir_percentage = 50
51:vmm_reservoir_size_mb = 2048
gimlet-sn06 # svcadm enable sled-agent

Looking at the logs tail -F $(svcs -L sled-agent):

thread 'main' panicked at /home/mike/src/omicron/sled-agent/src/sled_agent.rs:399:35:
only one of vmm_reservoir_percentage and vmm_reservoir_size_mb is allowed
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Ensure vmm_reservoir_percentage is between 0-100:

gimlet-sn06 # rg "^vmm_reservoir_percentage|^vmm_reservoir_size_mb" /opt/oxide/sled-agent/pkg/config.toml
46:vmm_reservoir_percentage = 110

Logs:

20:38:43.343Z ERRO SledAgent: Failed to setup VMM reservoir: Invalid reservoir configuration: VMM reservoir percentage of 110 must be between 0 and 100
    file = sled-agent/src/sled_agent.rs:414
    sled_id = 25f55d14-51a1-4698-9bd7-ac817dc56d88
20:38:43.343Z INFO SledAgent (dropshot (BootstrapAgent)): received request to begin graceful shutdown
    file = /home/mike/.cargo/git/checkouts/dropshot-a4a923d29dccc492/fa728d0/dropshot/src/server.rs:266
    local_addr = [fdb0:e8ea:6a09:6c32::1]:80
sled-agent: Failed to start sled-agent server: Error managing instances: Invalid reservoir configuration: VMM reservoir percentage of 110 must be between 0 and 100

Ensure passing 0 for a percentage still works:

gimlet-sn06 # rg "^vmm_reservoir_percentage|^vmm_reservoir_size_mb" /opt/oxide/sled-agent/pkg/config.toml
46:vmm_reservoir_percentage = 0
20:42:09.886Z WARN SledAgent: Not using VMM reservoir (size 0 bytes requested)
    file = sled-agent/src/sled_agent.rs:408
    sled_id = 25f55d14-51a1-4698-9bd7-ac817dc56d88

Ensure passing 0 for a mb works:

gimlet-sn06 # rg "^vmm_reservoir_percentage|^vmm_reservoir_size_mb" /opt/oxide/sled-agent/pkg/config.toml
51:vmm_reservoir_size_mb = 0
20:45:37.137Z WARN SledAgent: Not using VMM reservoir (size 0 bytes requested)
    file = sled-agent/src/sled_agent.rs:408
    sled_id = 25f55d14-51a1-4698-9bd7-ac817dc56d88

@papertigers
Copy link
Contributor Author

Re-running the failed test because we hit:

712 | 2023-11-09T18:58:57.243Z | Caused by:
-- | -- | --
713 | 2023-11-09T18:58:57.268Z | API rate limit exceeded for 66.117.152.2. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)
714 | 2023-11-09T18:58:57.292Z | Documentation URL: https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting

Copy link
Contributor

@jordanhendricks jordanhendricks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ship it!

@papertigers papertigers merged commit 1593c98 into oxidecomputer:main Nov 10, 2023
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants