-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sled-agent] Replace du(1)
with a Rust function
#6482
Conversation
Currently, the `zone_bundle::disk_usage` function in sled-agent shells out to `du(1)` and parses its output. This works fine, but it's a bit flaky, especially when running the corresponding tests on "weird" platforms such as NixOS, where the `du(1)` binary doesn't live in `/usr/bin/du` (see #6479). This commit replaces the code that shells out to `/usr/bin/du` with a pure-Rust rewrite. There's a unit test that asserts that the disk usage calculated by the new implementation matches the value calculated by `du(1)`, with code for handling the potential inaccuracy due to `du(1)` operating in terms of 512B blocks on macOS. Closes #6479
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice thanks! A few nits, but looks good to me.
} else { | ||
compile_error!("unsupported target OS"); | ||
let path = path.clone(); | ||
// We could, alternatively, just implement this using `tokio::fs` to read |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not looked at this much yet but I wonder if we should just use the walkdir
crate that we already have? I feel like there are so many edge cases in doing traversal that it's worth using a crate (but I don't feel strongly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, i didn't realize that was in our dependency tree, I'll take a look at it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
b4e8731 changes this to use walkdir
. I will add that, on my machine, the walkdir
implementation takes ~100 ms to calculate the size of sled-agent/src
, while my previous implementation did it in ~8 ms, and du(1)
only takes about 6 ms. I'm not sure if we care about performance here that much, though, so that may not matter a whole lot --- making sure we're handling edge cases is definitely important, but we're also not using all that much of walkdir
's functionality (e.g. traversing symlinks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at https://github.com/uutils/coreutils/blob/8fe93ded741f8257e9d0a5bce12f904e09227a59/src/uu/du/src/du.rs#L319, at least du
looks pretty simple. There are definitely issues around traversing symlinks and bind mounts (loops in general) but I guess once we're not traversing symlinks, the only other check one could do is to ensure we don't cross filesystems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also that code's recursive, huh. I'd definitely write this using a heap-allocated stack personally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also that code's recursive, huh. I'd definitely write this using a heap-allocated stack personally.
That was what my original implementation did! :)
sled-agent/src/zone_bundle.rs
Outdated
// - Parsing stdout fails | ||
// - Parsing the actual size as a u64 fails | ||
// This returns an error if reading any file or directory within the provided | ||
// directory fails. | ||
async fn disk_usage(path: &Utf8PathBuf) -> Result<u64, BundleError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's probably worth considering a different name here, given the function appears to be assessing the apparent size of a tree of files rather than the actual disk usage. I definitely would have guessed the other way based on the name!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, that's a good point --- du(1)
was actually reporting physical disk usage. We should probably change this to be more faithful to du
rather than reporting logical size...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want std::fs::unix::MetadataExt::blocks
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, that's a good point ---
du(1)
was actually reporting physical disk usage.
I don't believe it was, because we were using the -A
flag. To be clear I think the name was already confusing haha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, okay, I can just rename the function, then. Unless we want the other behavior --- I'm not totally sure. @bnaecker, this is your code, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, yeah. I have no preference on the name, that was chosen since it was calling out to du
. Whatever you want to call it is fine by me! Josh is correct that the calling code is using that to find the size of a tree of files, during a check against the zone bundle dataset quota.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
772cd16 renames it to dir_size
, but I'm happy to change the name again if there's something y'all think would be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can that be &Utf8Path btw? I know you didn't touch that specific line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can, yeah — I just didn't want to mess with the surrounding code as part of a quick fix. Happy to go and change it as well.
Hmm, it looks like the Interestingly, on my Linux machine, I've been pretty consistently seeing the exact same sizes. I wonder what the difference in behavior is caused by... |
Maybe the problem is that I'm skipping directories, and
But perhaps it'll start passing on CI? And then we can, once again, return to where we started: this test always fails on my computer specifically. :) (as you can see above, I also made the test run |
Welp. It looks like including directories in the size calculation makes the behavior match But, it makes it no longer match the version of |
FWIW, I would just make the test pass with illumos and turn it off on other platforms. We only really run the sled agent there anyway. |
Yeah...I had really hoped rewriting the function in Rust would mean that we could run the tests everywhere, but, well, the original goal of this change was to make the test stop failing on my Linux box, so... |
@jclulow @davepacheco are either of you interested in weighing in on this again before I merge it? Hopefully at this point it's fairly uncontroversial... |
Go for it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Peanut gallery checking in with one question that popped to mind reading this.
Definitely uncontroversial :) I would press "Approve" but I expect that's not supposed to be available to me. It currently is, though. Only hubris repo's settings were previously changed to not allow approval and request-changes from outsiders.
EDIT: Oop, missed by seconds :D
const DU_ARG: &str = "-k"; | ||
} else { | ||
compile_error!("unsupported target OS"); | ||
// This returns an error if reading any file or directory within the provided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Is there a reason that these are not doc comments?
@@ -582,6 +582,11 @@ pub enum BundleError { | |||
|
|||
#[error("Instance is terminating")] | |||
InstanceTerminating, | |||
|
|||
/// The `walkdir` crate's errors already include the path which could not be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: Good explanation <3
Currently, the
zone_bundle::disk_usage
function in sled-agent shellsout to
du(1)
and parses its output. This works fine, but it's a bitflaky, especially when running the corresponding tests on "weird"
platforms such as NixOS, where the
du(1)
binary doesn't live in/usr/bin/du
(see #6479).This commit replaces the code that shells out to
/usr/bin/du
with apure-Rust rewrite. There's a unit test that asserts that the disk usage
calculated by the new implementation matches the value calculated by
du(1)
, with code for handling the potential inaccuracy due todu(1)
operating in terms of 512B blocks on macOS.
Closes #6479