Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't unwrap when we can't create a dataset #992

Merged
merged 6 commits into from
Oct 13, 2023
Merged

Don't unwrap when we can't create a dataset #992

merged 6 commits into from
Oct 13, 2023

Conversation

leftwo
Copy link
Contributor

@leftwo leftwo commented Oct 6, 2023

If we can't create a dataset for a region, don't panic, just fail.

This removes all the unwrap() calls from the worker function.

Here is the log from the agent when a region creation fails:

21:58:17.488Z INFO crucible-agent (worker): Region size:53687091200 reservation:67108864000 quota:161061273600                                                      
21:58:17.505Z INFO crucible-agent (worker): zfs set reservation of 67108864000 for oxp_31bd71cd-4736-4a12-a387-9b74b050396f/crucible/regions/b3eec9cd-1152-4a50-aa4-8a87cd8c916a                                                                      
21:58:17.505Z INFO crucible-agent (worker): zfs set quota of 161061273600 for oxp_31bd71cd-4736-4a12-a387-9b74b050396f/crucible/regions/b3eec9cd-1152-4a50-aa41-8a87cd8c916a                                                                          
21:58:17.557Z ERRO crucible-agent (worker): Dataset b3eec9cd-1152-4a50-aa41-8a87cd8c916a creation failed: zfs create failed! out: err:cannot create 'oxp_31bd71cd-4736-4a12-a387-9b74b050396f/crucible/regions/b3eec9cd-1152-4a50-aa41-8a87cd8c916a': out of space                                                                      
21:58:17.557Z INFO crucible-agent (datafile): region b3eec9cd-1152-4a50-aa41-8a87cd8c916a state: Requested -> Destroyed                                             
21:58:17.792Z INFO crucible-agent (dropshot): request completed 

If we can't create a dataset for a region, don't panic, just fail.
@leftwo leftwo requested a review from jmpesp October 6, 2023 20:01
@leftwo
Copy link
Contributor Author

leftwo commented Oct 6, 2023

What would you consider if we also make:
&region_dataset.path().unwrap(),

A local variable to this match arm, and do the same break requested if we can't get the name.
I see there are a few places with that same unwrap, but I don't know if we should forge ahead if
that unwrap fails, or also df.fail(&r.id) and break..

@leftwo leftwo marked this pull request as ready for review October 9, 2023 17:50
@leftwo
Copy link
Contributor Author

leftwo commented Oct 9, 2023

More unwraps removed, or, less unwraps, or .. fewer.

@leftwo leftwo added this to the 3 milestone Oct 9, 2023
@leftwo
Copy link
Contributor Author

leftwo commented Oct 11, 2023

Attached is a nexus log from the disk create that fails when there is not enough space:

saga-log.txt

This change, while not 100% of what we want, will at least prevent us from going off the edge
and panic'ing the agent.

Copy link
Contributor

@jmpesp jmpesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with these changes, however we may want to think about retrying until success here. Nexus will issue a request and then poll the agent waiting for the state transition to take place. If there's an error here, Nexus will poll indefinitely.

Copy link
Contributor

@jmpesp jmpesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, scratch what I said before! If the agent attempts to retry and we're out of space, that's not desirable.

&r.id.0,
e,
);
if let Err(e) = df.destroyed(&r.id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not df.fail here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had df.fail() there originally, but then the disk create saga unwinds and trys to delete the
disk, and we can't delete a failed disk. I also had tried making it Tombstoned, and that made
the saga unhappy as well.

The current stuff here does not require any changes on the Omicron side. I think eventually we
may want to pass a specific message back to Omicron that we lack the space for this request.
Also in the work to do column is having Omicron do better accounting and not even try to allocate
a disk if we don't have space for it.

@leftwo
Copy link
Contributor Author

leftwo commented Oct 12, 2023

Actually, scratch what I said before! If the agent attempts to retry and we're out of space, that's not desirable.

The way it works now, we just fail the request and the saga unwinds and reports a 500 back to the user:

error; status code: 500 Internal Server Error
{
  "error_code": "Internal",
  "message": "Internal Server Error",
  "request_id": "998d13ff-29d8-482e-841d-2d305928bbfd"
}

Not the best, but better than the agent panicing :)
I think a longer term solution can come in with Omicron changes to better calculate disk resources to begin with,
as well as take more return codes from the agent and pass them back to the user.

@leftwo leftwo requested a review from jmpesp October 13, 2023 00:19
@leftwo
Copy link
Contributor Author

leftwo commented Oct 13, 2023

Now Fail is a vaild state to then destroy.
No Omicron side changes needed.

@@ -773,7 +762,10 @@ impl DataFile {
let region = region.unwrap();

match region.state {
State::Requested | State::Destroyed | State::Tombstoned => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Nexus is doing the right thing if it's asking for snapshots from a failed region - where did the call come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of the region delete, it first checks for snapshots, agent/src/server.rs:

#[endpoint {                                                                      
    method = DELETE,                                                              
    path = "/crucible/0/regions/{id}",                                            
}]                                                                                
async fn region_delete(                                                           
    rc: RequestContext<Arc<DataFile>>,                                            
    path: TypedPath<RegionPath>,                                                  
) -> SResult<HttpResponseDeleted, HttpError> {                                    
    let p = path.into_inner();                                                    
                                                                                  
    // Cannot delete a region that's backed by a ZFS dataset if there are         
    // snapshots.                                                                 
                                                                                      let snapshots = match rc.context().get_snapshots_for_region(&p.id) {          
        Ok(results) => results,                                                   
        Err(e) => {                                                               
            return Err(HttpError::for_internal_error(e.to_string()));             
        }                                                                         
    };

Without that, the delete is refused because we get an error back from the get_snapshots call.

@leftwo leftwo merged commit b7a6856 into main Oct 13, 2023
18 checks passed
@leftwo leftwo deleted the alan/unwrapless branch October 13, 2023 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants