Don't unwrap when we can't create a dataset #992

leftwo · 2023-10-06T20:01:57Z

If we can't create a dataset for a region, don't panic, just fail.

This removes all the unwrap() calls from the worker function.

Here is the log from the agent when a region creation fails:

21:58:17.488Z INFO crucible-agent (worker): Region size:53687091200 reservation:67108864000 quota:161061273600                                                      
21:58:17.505Z INFO crucible-agent (worker): zfs set reservation of 67108864000 for oxp_31bd71cd-4736-4a12-a387-9b74b050396f/crucible/regions/b3eec9cd-1152-4a50-aa4-8a87cd8c916a                                                                      
21:58:17.505Z INFO crucible-agent (worker): zfs set quota of 161061273600 for oxp_31bd71cd-4736-4a12-a387-9b74b050396f/crucible/regions/b3eec9cd-1152-4a50-aa41-8a87cd8c916a                                                                          
21:58:17.557Z ERRO crucible-agent (worker): Dataset b3eec9cd-1152-4a50-aa41-8a87cd8c916a creation failed: zfs create failed! out: err:cannot create 'oxp_31bd71cd-4736-4a12-a387-9b74b050396f/crucible/regions/b3eec9cd-1152-4a50-aa41-8a87cd8c916a': out of space                                                                      
21:58:17.557Z INFO crucible-agent (datafile): region b3eec9cd-1152-4a50-aa41-8a87cd8c916a state: Requested -> Destroyed                                             
21:58:17.792Z INFO crucible-agent (dropshot): request completed

If we can't create a dataset for a region, don't panic, just fail.

leftwo · 2023-10-06T20:04:01Z

What would you consider if we also make:
&region_dataset.path().unwrap(),

A local variable to this match arm, and do the same break requested if we can't get the name.
I see there are a few places with that same unwrap, but I don't know if we should forge ahead if
that unwrap fails, or also df.fail(&r.id) and break..

leftwo · 2023-10-09T17:50:55Z

More unwraps removed, or, less unwraps, or .. fewer.

agent/src/main.rs

leftwo · 2023-10-11T22:33:19Z

Attached is a nexus log from the disk create that fails when there is not enough space:

saga-log.txt

This change, while not 100% of what we want, will at least prevent us from going off the edge
and panic'ing the agent.

jmpesp

I agree with these changes, however we may want to think about retrying until success here. Nexus will issue a request and then poll the agent waiting for the state transition to take place. If there's an error here, Nexus will poll indefinitely.

jmpesp

Actually, scratch what I said before! If the agent attempts to retry and we're out of space, that's not desirable.

jmpesp · 2023-10-12T18:27:43Z

agent/src/main.rs

+                                    &r.id.0,
+                                    e,
+                                );
+                                if let Err(e) = df.destroyed(&r.id) {


Why not df.fail here?

I had df.fail() there originally, but then the disk create saga unwinds and trys to delete the
disk, and we can't delete a failed disk. I also had tried making it Tombstoned, and that made
the saga unhappy as well.

The current stuff here does not require any changes on the Omicron side. I think eventually we
may want to pass a specific message back to Omicron that we lack the space for this request.
Also in the work to do column is having Omicron do better accounting and not even try to allocate
a disk if we don't have space for it.

leftwo · 2023-10-12T20:21:42Z

Actually, scratch what I said before! If the agent attempts to retry and we're out of space, that's not desirable.

The way it works now, we just fail the request and the saga unwinds and reports a 500 back to the user:

error; status code: 500 Internal Server Error
{
  "error_code": "Internal",
  "message": "Internal Server Error",
  "request_id": "998d13ff-29d8-482e-841d-2d305928bbfd"
}

Not the best, but better than the agent panicing :)
I think a longer term solution can come in with Omicron changes to better calculate disk resources to begin with,
as well as take more return codes from the agent and pass them back to the user.

leftwo · 2023-10-13T00:19:42Z

Now Fail is a vaild state to then destroy.
No Omicron side changes needed.

jmpesp · 2023-10-13T02:42:44Z

agent/src/datafile.rs

@@ -773,7 +762,10 @@ impl DataFile {
        let region = region.unwrap();

        match region.state {
-            State::Requested | State::Destroyed | State::Tombstoned => {


I don't think Nexus is doing the right thing if it's asking for snapshots from a failed region - where did the call come from?

As part of the region delete, it first checks for snapshots, agent/src/server.rs:

#[endpoint { method = DELETE, path = "/crucible/0/regions/{id}", }] async fn region_delete( rc: RequestContext<Arc<DataFile>>, path: TypedPath<RegionPath>, ) -> SResult<HttpResponseDeleted, HttpError> { let p = path.into_inner(); // Cannot delete a region that's backed by a ZFS dataset if there are // snapshots. let snapshots = match rc.context().get_snapshots_for_region(&p.id) { Ok(results) => results, Err(e) => { return Err(HttpError::for_internal_error(e.to_string())); } };

Without that, the delete is refused because we get an error back from the get_snapshots call.

Don't unwrap when we can't create a region

f9c7826

If we can't create a dataset for a region, don't panic, just fail.

leftwo requested a review from jmpesp October 6, 2023 20:01

Fewer unwraps

beef95c

leftwo marked this pull request as ready for review October 9, 2023 17:50

leftwo added this to the 3 milestone Oct 9, 2023

morlandi7 assigned leftwo Oct 10, 2023

leftwo commented Oct 10, 2023

View reviewed changes

agent/src/main.rs Show resolved Hide resolved

Alan Hanson added 2 commits October 11, 2023 00:43

Try Tombstoned instead of Failed

56704b0

Requested -> Destroyed

1a56db9

jmpesp approved these changes Oct 12, 2023

View reviewed changes

jmpesp requested changes Oct 12, 2023

View reviewed changes

dont fail at failing

d65da02

leftwo requested a review from jmpesp October 13, 2023 00:19

jmpesp reviewed Oct 13, 2023

View reviewed changes

Merge branch 'main' into alan/unwrapless

0a6b3fa

jmpesp approved these changes Oct 13, 2023

View reviewed changes

leftwo merged commit b7a6856 into main Oct 13, 2023
18 checks passed

leftwo deleted the alan/unwrapless branch October 13, 2023 15:26

leftwo mentioned this pull request Oct 16, 2023

Crucible agent should not panic when the dataset is out of space #861

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't unwrap when we can't create a dataset #992

Don't unwrap when we can't create a dataset #992

leftwo commented Oct 6, 2023 •

edited

Loading

leftwo commented Oct 6, 2023

leftwo commented Oct 9, 2023

leftwo commented Oct 11, 2023

jmpesp left a comment

jmpesp left a comment

jmpesp Oct 12, 2023

leftwo Oct 12, 2023

leftwo commented Oct 12, 2023

leftwo commented Oct 13, 2023

jmpesp Oct 13, 2023

leftwo Oct 13, 2023

Don't unwrap when we can't create a dataset #992

Don't unwrap when we can't create a dataset #992

Conversation

leftwo commented Oct 6, 2023 • edited Loading

leftwo commented Oct 6, 2023

leftwo commented Oct 9, 2023

leftwo commented Oct 11, 2023

jmpesp left a comment

Choose a reason for hiding this comment

jmpesp left a comment

Choose a reason for hiding this comment

jmpesp Oct 12, 2023

Choose a reason for hiding this comment

leftwo Oct 12, 2023

Choose a reason for hiding this comment

leftwo commented Oct 12, 2023

leftwo commented Oct 13, 2023

jmpesp Oct 13, 2023

Choose a reason for hiding this comment

leftwo Oct 13, 2023

Choose a reason for hiding this comment

leftwo commented Oct 6, 2023 •

edited

Loading