-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't unwrap when we can't create a dataset #992
Conversation
If we can't create a dataset for a region, don't panic, just fail.
What would you consider if we also make: A local variable to this match arm, and do the same |
More unwraps removed, or, less unwraps, or .. fewer. |
Attached is a nexus log from the disk create that fails when there is not enough space: This change, while not 100% of what we want, will at least prevent us from going off the edge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with these changes, however we may want to think about retrying until success here. Nexus will issue a request and then poll the agent waiting for the state transition to take place. If there's an error here, Nexus will poll indefinitely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, scratch what I said before! If the agent attempts to retry and we're out of space, that's not desirable.
agent/src/main.rs
Outdated
&r.id.0, | ||
e, | ||
); | ||
if let Err(e) = df.destroyed(&r.id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not df.fail
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had df.fail()
there originally, but then the disk create saga unwinds and trys to delete the
disk, and we can't delete a failed disk. I also had tried making it Tombstoned, and that made
the saga unhappy as well.
The current stuff here does not require any changes on the Omicron side. I think eventually we
may want to pass a specific message back to Omicron that we lack the space for this request.
Also in the work to do column is having Omicron do better accounting and not even try to allocate
a disk if we don't have space for it.
The way it works now, we just fail the request and the saga unwinds and reports a 500 back to the user:
Not the best, but better than the agent panicing :) |
Now Fail is a vaild state to then destroy. |
@@ -773,7 +762,10 @@ impl DataFile { | |||
let region = region.unwrap(); | |||
|
|||
match region.state { | |||
State::Requested | State::Destroyed | State::Tombstoned => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think Nexus is doing the right thing if it's asking for snapshots from a failed region - where did the call come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As part of the region delete, it first checks for snapshots, agent/src/server.rs
:
#[endpoint {
method = DELETE,
path = "/crucible/0/regions/{id}",
}]
async fn region_delete(
rc: RequestContext<Arc<DataFile>>,
path: TypedPath<RegionPath>,
) -> SResult<HttpResponseDeleted, HttpError> {
let p = path.into_inner();
// Cannot delete a region that's backed by a ZFS dataset if there are
// snapshots.
let snapshots = match rc.context().get_snapshots_for_region(&p.id) {
Ok(results) => results,
Err(e) => {
return Err(HttpError::for_internal_error(e.to_string()));
}
};
Without that, the delete is refused because we get an error back from the get_snapshots call.
If we can't create a dataset for a region, don't panic, just fail.
This removes all the
unwrap()
calls from theworker
function.Here is the log from the agent when a region creation fails: