Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sled-agent] Replace du(1) with a Rust function #6482

Merged
merged 8 commits into from
Sep 3, 2024
Merged

[sled-agent] Replace du(1) with a Rust function #6482

merged 8 commits into from
Sep 3, 2024

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Aug 29, 2024

Currently, the zone_bundle::disk_usage function in sled-agent shells
out to du(1) and parses its output. This works fine, but it's a bit
flaky, especially when running the corresponding tests on "weird"
platforms such as NixOS, where the du(1) binary doesn't live in
/usr/bin/du (see #6479).

This commit replaces the code that shells out to /usr/bin/du with a
pure-Rust rewrite. There's a unit test that asserts that the disk usage
calculated by the new implementation matches the value calculated by
du(1), with code for handling the potential inaccuracy due to du(1)
operating in terms of 512B blocks on macOS.

Closes #6479

hawkw added 2 commits August 29, 2024 11:10
Currently, the `zone_bundle::disk_usage` function in sled-agent shells
out to `du(1)` and parses its output. This works fine, but it's a bit
flaky, especially when running the corresponding tests on "weird"
platforms such as NixOS, where the `du(1)` binary doesn't live in
`/usr/bin/du` (see #6479).

This commit replaces the code that shells out to `/usr/bin/du` with a
pure-Rust rewrite. There's a unit test that asserts that the disk usage
calculated by the new implementation matches the value calculated by
`du(1)`, with code for handling the potential inaccuracy due to `du(1)`
operating in terms of 512B blocks on macOS.

Closes #6479
Copy link
Collaborator

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thanks! A few nits, but looks good to me.

sled-agent/src/zone_bundle.rs Outdated Show resolved Hide resolved
sled-agent/src/zone_bundle.rs Outdated Show resolved Hide resolved
@hawkw
Copy link
Member Author

hawkw commented Aug 29, 2024

@bnaecker ab6aaf0 addresses both of your suggestions, thank you!

} else {
compile_error!("unsupported target OS");
let path = path.clone();
// We could, alternatively, just implement this using `tokio::fs` to read
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not looked at this much yet but I wonder if we should just use the walkdir crate that we already have? I feel like there are so many edge cases in doing traversal that it's worth using a crate (but I don't feel strongly).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, i didn't realize that was in our dependency tree, I'll take a look at it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b4e8731 changes this to use walkdir. I will add that, on my machine, the walkdir implementation takes ~100 ms to calculate the size of sled-agent/src, while my previous implementation did it in ~8 ms, and du(1) only takes about 6 ms. I'm not sure if we care about performance here that much, though, so that may not matter a whole lot --- making sure we're handling edge cases is definitely important, but we're also not using all that much of walkdir's functionality (e.g. traversing symlinks).

Copy link
Contributor

@sunshowers sunshowers Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at https://github.com/uutils/coreutils/blob/8fe93ded741f8257e9d0a5bce12f904e09227a59/src/uu/du/src/du.rs#L319, at least du looks pretty simple. There are definitely issues around traversing symlinks and bind mounts (loops in general) but I guess once we're not traversing symlinks, the only other check one could do is to ensure we don't cross filesystems.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also that code's recursive, huh. I'd definitely write this using a heap-allocated stack personally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also that code's recursive, huh. I'd definitely write this using a heap-allocated stack personally.

That was what my original implementation did! :)

@hawkw hawkw requested a review from davepacheco August 29, 2024 21:17
// - Parsing stdout fails
// - Parsing the actual size as a u64 fails
// This returns an error if reading any file or directory within the provided
// directory fails.
async fn disk_usage(path: &Utf8PathBuf) -> Result<u64, BundleError> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's probably worth considering a different name here, given the function appears to be assessing the apparent size of a tree of files rather than the actual disk usage. I definitely would have guessed the other way based on the name!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, that's a good point --- du(1) was actually reporting physical disk usage. We should probably change this to be more faithful to du rather than reporting logical size...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, that's a good point --- du(1) was actually reporting physical disk usage.

I don't believe it was, because we were using the -A flag. To be clear I think the name was already confusing haha

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, okay, I can just rename the function, then. Unless we want the other behavior --- I'm not totally sure. @bnaecker, this is your code, right?

Copy link
Collaborator

@bnaecker bnaecker Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, yeah. I have no preference on the name, that was chosen since it was calling out to du. Whatever you want to call it is fine by me! Josh is correct that the calling code is using that to find the size of a tree of files, during a check against the zone bundle dataset quota.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

772cd16 renames it to dir_size, but I'm happy to change the name again if there's something y'all think would be clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can that be &Utf8Path btw? I know you didn't touch that specific line

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can, yeah — I just didn't want to mess with the surrounding code as part of a quick fix. Happy to go and change it as well.

@hawkw
Copy link
Member Author

hawkw commented Aug 30, 2024

Hmm, it looks like the disk_usage_matches_du test has been pretty consistently failing on CI because the Rust function reports a few fewer bytes than du: https://buildomat.eng.oxide.computer/wg/0/details/01J6J0P8D5V1YZR7AVT2R5F6DV/XrNrdk6kN2AdU90PUSVGK8gFhIwk9AmyWUCN1apAyz3vMdVU/01J6J0PPKMWXBWZ7SK1QQV9QVE#S5290

Interestingly, on my Linux machine, I've been pretty consistently seeing the exact same sizes. I wonder what the difference in behavior is caused by...

@hawkw
Copy link
Member Author

hawkw commented Aug 30, 2024

Maybe the problem is that I'm skipping directories, and du isn't? Commit 8fe21ff changes it to include the size of dirs in the sum...on my machine, this causes the test to fail:

--- STDERR:              omicron-sled-agent zone_bundle::tests::test_dir_size_matches_du ---
du --version:
du (GNU coreutils) 9.5
Packaged by https://nixos.org
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Torbjörn Granlund, David MacKenzie, Paul Eggert,
and Jim Meyering.


[sled-agent/src/zone_bundle.rs:1725:13] dir_size(&path).await = Ok(
    1340678,
)
dir_size(/home/eliza/Code/oxide/omicron/sled-agent/src) took 865.155µs
[sled-agent/src/zone_bundle.rs:1730:13] dir_size_du(&path).await = Ok(
    1340591,
)
du -s /home/eliza/Code/oxide/omicron/sled-agent/src took 1.866788ms
thread 'zone_bundle::tests::test_dir_size_matches_du' panicked at sled-agent/src/zone_bundle.rs:1742:9:
assertion `left == right` failed: expected `dir_size(/home/eliza/Code/oxide/omicron/sled-agent/src)` to == `du -s /home/eliza/Code/oxide/omicron/sled-agent/src`
1340678B (`dir_size`) != 1340591B (`du`)
  left: 1340678
 right: 1340591
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

But perhaps it'll start passing on CI? And then we can, once again, return to where we started: this test always fails on my computer specifically. :)

(as you can see above, I also made the test run du --version and print the output, so we can see what's running on the CI machines and how it might differ from my du)

@hawkw
Copy link
Member Author

hawkw commented Aug 30, 2024

Welp. It looks like including directories in the size calculation makes the behavior match du on the Ubuntu test runners: https://github.com/oxidecomputer/omicron/pull/6482/checks?check_run_id=29492777413

But, it makes it no longer match the version of du on my computer. That's so cool! I hate everything and want to die!!!

@jclulow
Copy link
Collaborator

jclulow commented Aug 31, 2024

FWIW, I would just make the test pass with illumos and turn it off on other platforms. We only really run the sled agent there anyway.

@hawkw
Copy link
Member Author

hawkw commented Aug 31, 2024

FWIW, I would just make the test pass with illumos and turn it off on other platforms. We only really run the sled agent there anyway.

Yeah...I had really hoped rewriting the function in Rust would mean that we could run the tests everywhere, but, well, the original goal of this change was to make the test stop failing on my Linux box, so...

@hawkw
Copy link
Member Author

hawkw commented Sep 3, 2024

@jclulow @davepacheco are either of you interested in weighing in on this again before I merge it? Hopefully at this point it's fairly uncontroversial...

@davepacheco
Copy link
Collaborator

Go for it!

@hawkw hawkw merged commit b6a098d into main Sep 3, 2024
23 checks passed
@hawkw hawkw deleted the eliza/du-rust branch September 3, 2024 18:05
Copy link

@aapoalas aapoalas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Peanut gallery checking in with one question that popped to mind reading this.

Definitely uncontroversial :) I would press "Approve" but I expect that's not supposed to be available to me. It currently is, though. Only hubris repo's settings were previously changed to not allow approval and request-changes from outsiders.

EDIT: Oop, missed by seconds :D

const DU_ARG: &str = "-k";
} else {
compile_error!("unsupported target OS");
// This returns an error if reading any file or directory within the provided
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Is there a reason that these are not doc comments?

@@ -582,6 +582,11 @@ pub enum BundleError {

#[error("Instance is terminating")]
InstanceTerminating,

/// The `walkdir` crate's errors already include the path which could not be
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: Good explanation <3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants