Accept notifications from Crucible #5135

jmpesp · 2024-02-23T21:17:38Z

Allow any Upstairs to notify Nexus about the start or completion (plus status) of live repairs. The motivation for this was to be used in the final stage of region replacement to notify Nexus that the replacement has finished, but more generally this can be used to keep track of how many times repair occurs for each region.

Fixes #5120

Allow any Upstairs to notify Nexus about the start or completion (plus status) of live repairs. The motivation for this was to be used in the final stage of region replacement to notify Nexus that the replacement has finished, but more generally this can be used to keep track of how many times repair occurs for each region. Fixes oxidecomputer#5120

leftwo

I have some questions about this, to make sure I understand it :)

The plan is that we will be storing information about a volume under repair
in a new table in the database? But I'm not clear on how we connect this table
back to a specific volume.

I do think we should not call this LiveRepair.. and just call it Repair.. everywhere.
We can re-use all of this for the same mechanism that happens when an upstairs
first starts and needs to reconcile the three downstairs with each other.

It seems like it would not be too difficult to add an update or status endpoint
that could report to nexus the progress of a repair. Maybe we want to put a
column in the database now to allow us to start using it later?

Also, thinking about migration cases, do we want another column in the the
database to indicate why? I've not given this much thought yet, and I don't
think this particular table is going to be exposed to the user. I'm just thinking
about future us trying to determine why a Repair was started. Having some
fields in the database to leave us breadcrumbs might be helpful.

nexus/db-model/src/live_repair.rs

nexus/db-queries/src/db/datastore/volume.rs

nexus/src/app/volume.rs

schema/crdb/37.0.0/up02.sql

uuid-kinds/src/lib.rs

jmpesp · 2024-02-27T22:34:19Z

The plan is that we will be storing information about a volume under repair in a new table in the database? But I'm not clear on how we connect this table back to a specific volume.

There's going to be a region replacement request that can be tied back to the Volume by checking the connection addresses in the sub volumes. For read-write region replacements, the region under replacement can only be tied back to one volume, and because of the current region allocation strategy each region is allocated on a separate sled.

Note this is to work around us not storing the region ID in the VCR, and because we have to assume that the region's Crucible agent will not be accessible.

I do think we should not call this LiveRepair.. and just call it Repair.. everywhere. We can re-use all of this for the same mechanism that happens when an upstairs first starts and needs to reconcile the three downstairs with each other.

Agreed, done in ff08b00

It seems like it would not be too difficult to add an update or status endpoint that could report to nexus the progress of a repair. Maybe we want to put a column in the database now to allow us to start using it later?

👍, done in 0cd8601

Also, thinking about migration cases, do we want another column in the the database to indicate why? I've not given this much thought yet, and I don't think this particular table is going to be exposed to the user. I'm just thinking about future us trying to determine why a Repair was started. Having some fields in the database to leave us breadcrumbs might be helpful.

I'm also going to think more about this tonight.

leftwo · 2024-02-27T23:12:18Z

From the update meeting today, I think for sure we want to treat the initial reconciliation the same
way as we do a LiveRepair, as that is the path for replacement we will have to take for a disk that
is not attached to any instance.

In that situation we will have to spin up a pantry (either before or after the replacement, probably
before) and then let the pantry do the initial reconciliation, sending status to Nexus when all regions
are consistent and then the pantry would shut itself down.

leftwo

I know I asked for it, but now I have a complication for you.

When doing an initial reconciliation, it's possible that all three downstairs have
different extents that need correction. Unlikely, but possible.

How would we represent this? By having all three regions report they are being
repaired? I think that can work, but it may look strange from the outside.

nexus/db-model/src/schema.rs

nexus/db-model/src/upstairs_repair.rs

nexus/db-queries/src/db/datastore/volume.rs

schema/crdb/37.0.0/up01.sql

jmpesp · 2024-02-28T18:24:51Z

When doing an initial reconciliation, it's possible that all three downstairs have different extents that need correction. Unlikely, but possible.

How would we represent this? By having all three regions report they are being repaired? I think that can work, but it may look strange from the outside.

This would look like three Reconciliation records all with the same repair id, which we can then tell is this unlikely scenario :)

jmpesp · 2024-02-28T21:52:41Z

I'm just thinking about future us trying to determine why a Repair was started. Having some fields in the database to leave us breadcrumbs might be helpful.

Back to this - I was thinking what a breadcrumb would get us over now having the repair type enum.

If it's Live, then there are two reasons for starting live repair (that I know of): receiving an IO error for a write / flush / extent repair op from a Downstairs, or the queue of live work exceeds IO_OUTSTANDING_MAX.

If it's Reconciliation, then that's when extents don't match at startup.

We could plumb up ClientStopReason?

leftwo · 2024-02-29T17:04:24Z

I'm just thinking about future us trying to determine why a Repair was started. Having some fields in the database to leave us breadcrumbs might be helpful.

Back to this - I was thinking what a breadcrumb would get us over now having the repair type enum.

If it's Live, then there are two reasons for starting live repair (that I know of): receiving an IO error for a write / flush / extent repair op from a Downstairs, or the queue of live work exceeds IO_OUTSTANDING_MAX.

If it's Reconciliation, then that's when extents don't match at startup.

We could plumb up ClientStopReason?

Yeah, I guess we would have to make up types for each possible reason.

leftwo · 2024-03-01T01:35:44Z

Back to this - I was thinking what a breadcrumb would get us over now having the repair type enum.
If it's Live, then there are two reasons for starting live repair (that I know of): receiving an IO error for a write / flush / extent repair op from a Downstairs, or the queue of live work exceeds IO_OUTSTANDING_MAX.
If it's Reconciliation, then that's when extents don't match at startup.
We could plumb up ClientStopReason?

Yeah, I guess we would have to make up types for each possible reason.

Oh, if Nexus tells the Upstairs that it has to replace a downstairs due to a pending upgrade on that
sled, that would be another reason. The Upstairs should know the difference though between a
LiveRepair that started as a result of a replacement, and a LiveRepair that started because there was
an error.

leftwo

We may need to fine tune this as we get further down the path with ds replacement,
but it seems like this is what we think we need now. If we need to update it later
we can do that.

nexus/db-model/src/schema.rs

nexus/db-queries/src/db/datastore/volume.rs

nexus/tests/integration_tests/volume_management.rs

leftwo · 2024-03-12T20:46:58Z

schema/crdb/41.0.0/up05.sql

+  'incompatible',
+  'failed_live_repair',
+  'too_many_outstanding_jobs',
+  'deactivated'


Do we need a timeout choice too? It could be either a connection closed from the remote side, or just nothing and we gave up on it.

I think so, yeah - ClientStopReason is recorded now but not if the client task has a ClientRunResult. I'll do that now.

df79ff2 adds separate endpoints for when an Upstairs requests a stop, and when a Downstairs client task stops.

jmpesp requested review from smklein and leftwo February 23, 2024 21:17

jmpesp mentioned this pull request Feb 23, 2024

Accept repair status reports from Crucible #5120

Closed

use replace directive for TypedUuidFor*

169b478

leftwo reviewed Feb 24, 2024

View reviewed changes

jmpesp added 6 commits February 26, 2024 15:30

no more underscores

7e7934e

support status for live repair and reconciliation

ff08b00

schema 37 -> 38

b1940f0

more schema update

1b03223

accept upstairs repair progress

0cd8601

tests pass

d2bf5f5

Merge branch 'main' into crucible_repair_status_reports

d6f41f3

leftwo reviewed Feb 28, 2024

View reviewed changes

nexus/db-model/src/schema.rs Show resolved Hide resolved

nexus/db-model/src/upstairs_repair.rs Show resolved Hide resolved

nexus/db-queries/src/db/datastore/volume.rs Outdated Show resolved Hide resolved

schema/crdb/37.0.0/up01.sql Outdated Show resolved Hide resolved

simple mismatched record type check

832e649

jmpesp added 4 commits March 1, 2024 16:07

Merge branch 'main' into crucible_repair_status_reports

6c44342

bad merge

53947b3

move retry_until_known_result into common

1c81d75

prepend /crucible/0/

c39094e

leftwo approved these changes Mar 9, 2024

View reviewed changes

nexus/db-model/src/schema.rs Outdated Show resolved Hide resolved

nexus/db-queries/src/db/datastore/volume.rs Outdated Show resolved Hide resolved

nexus/tests/integration_tests/volume_management.rs Outdated Show resolved Hide resolved

jmpesp added 4 commits March 11, 2024 13:11

add downstairs client task stopped notification

63d05ed

Merge branch 'main' into crucible_repair_status_reports

567ce0c

schema 38 -> 41

a5e0c9f

downstairs_client_stopped_notification sql

2b6fa26

jmpesp added 8 commits March 11, 2024 15:52

fmt

83106c3

snake case please

41e7aa7

Merge branch 'main' into crucible_repair_status_reports

86847b4

update URLs with prefix

8bc626c

use new Error::non_resourcetype_not_found

0b37b63

use a variable, they are the same requests

3dfabf0

test_upstairs_notify_downstairs_client_stop

6e8b63c

fmt

a5410aa

leftwo approved these changes Mar 12, 2024

View reviewed changes

jmpesp changed the title ~~Accept live repair status reports from Crucible~~ Accept notifications from Crucible Mar 12, 2024

jmpesp added 3 commits March 13, 2024 12:13

separate endpoints for stop request and stopped

df79ff2

missing omicron.public. prefix

737d18f

Merge branch 'main' into crucible_repair_status_reports

215b014

jmpesp enabled auto-merge (squash) March 14, 2024 15:53

jmpesp merged commit 2406d9d into oxidecomputer:main Mar 14, 2024
17 checks passed

jmpesp deleted the crucible_repair_status_reports branch March 22, 2024 18:46

jmpesp mentioned this pull request Apr 9, 2024

Update oximeter and crucible deps oxidecomputer/propolis#679

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accept notifications from Crucible #5135

Accept notifications from Crucible #5135

jmpesp commented Feb 23, 2024

leftwo left a comment

jmpesp commented Feb 27, 2024

leftwo commented Feb 27, 2024

leftwo left a comment

jmpesp commented Feb 28, 2024

jmpesp commented Feb 28, 2024

leftwo commented Feb 29, 2024

leftwo commented Mar 1, 2024

leftwo left a comment

leftwo Mar 12, 2024

jmpesp Mar 12, 2024

jmpesp Mar 13, 2024

Accept notifications from Crucible #5135

Accept notifications from Crucible #5135

Conversation

jmpesp commented Feb 23, 2024

leftwo left a comment

Choose a reason for hiding this comment

jmpesp commented Feb 27, 2024

leftwo commented Feb 27, 2024

leftwo left a comment

Choose a reason for hiding this comment

jmpesp commented Feb 28, 2024

jmpesp commented Feb 28, 2024

leftwo commented Feb 29, 2024

leftwo commented Mar 1, 2024

leftwo left a comment

Choose a reason for hiding this comment

leftwo Mar 12, 2024

Choose a reason for hiding this comment

jmpesp Mar 12, 2024

Choose a reason for hiding this comment

jmpesp Mar 13, 2024

Choose a reason for hiding this comment