Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accept notifications from Crucible #5135

Merged
merged 29 commits into from
Mar 14, 2024

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Feb 23, 2024

Allow any Upstairs to notify Nexus about the start or completion (plus status) of live repairs. The motivation for this was to be used in the final stage of region replacement to notify Nexus that the replacement has finished, but more generally this can be used to keep track of how many times repair occurs for each region.

Fixes #5120

Allow any Upstairs to notify Nexus about the start or completion (plus
status) of live repairs. The motivation for this was to be used in the
final stage of region replacement to notify Nexus that the replacement
has finished, but more generally this can be used to keep track of how
many times repair occurs for each region.

Fixes oxidecomputer#5120
Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some questions about this, to make sure I understand it :)

The plan is that we will be storing information about a volume under repair
in a new table in the database? But I'm not clear on how we connect this table
back to a specific volume.

I do think we should not call this LiveRepair.. and just call it Repair.. everywhere.
We can re-use all of this for the same mechanism that happens when an upstairs
first starts and needs to reconcile the three downstairs with each other.

It seems like it would not be too difficult to add an update or status endpoint
that could report to nexus the progress of a repair. Maybe we want to put a
column in the database now to allow us to start using it later?

Also, thinking about migration cases, do we want another column in the the
database to indicate why? I've not given this much thought yet, and I don't
think this particular table is going to be exposed to the user. I'm just thinking
about future us trying to determine why a Repair was started. Having some
fields in the database to leave us breadcrumbs might be helpful.

nexus/db-model/src/live_repair.rs Outdated Show resolved Hide resolved
nexus/db-model/src/live_repair.rs Outdated Show resolved Hide resolved
nexus/db-model/src/live_repair.rs Outdated Show resolved Hide resolved
nexus/db-model/src/live_repair.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/volume.rs Show resolved Hide resolved
nexus/src/app/volume.rs Show resolved Hide resolved
schema/crdb/37.0.0/up02.sql Outdated Show resolved Hide resolved
uuid-kinds/src/lib.rs Show resolved Hide resolved
@jmpesp
Copy link
Contributor Author

jmpesp commented Feb 27, 2024

The plan is that we will be storing information about a volume under repair in a new table in the database? But I'm not clear on how we connect this table back to a specific volume.

There's going to be a region replacement request that can be tied back to the Volume by checking the connection addresses in the sub volumes. For read-write region replacements, the region under replacement can only be tied back to one volume, and because of the current region allocation strategy each region is allocated on a separate sled.

Note this is to work around us not storing the region ID in the VCR, and because we have to assume that the region's Crucible agent will not be accessible.

I do think we should not call this LiveRepair.. and just call it Repair.. everywhere. We can re-use all of this for the same mechanism that happens when an upstairs first starts and needs to reconcile the three downstairs with each other.

Agreed, done in ff08b00

It seems like it would not be too difficult to add an update or status endpoint that could report to nexus the progress of a repair. Maybe we want to put a column in the database now to allow us to start using it later?

👍, done in 0cd8601

Also, thinking about migration cases, do we want another column in the the database to indicate why? I've not given this much thought yet, and I don't think this particular table is going to be exposed to the user. I'm just thinking about future us trying to determine why a Repair was started. Having some fields in the database to leave us breadcrumbs might be helpful.

I'm also going to think more about this tonight.

@leftwo
Copy link
Contributor

leftwo commented Feb 27, 2024

From the update meeting today, I think for sure we want to treat the initial reconciliation the same
way as we do a LiveRepair, as that is the path for replacement we will have to take for a disk that
is not attached to any instance.

In that situation we will have to spin up a pantry (either before or after the replacement, probably
before) and then let the pantry do the initial reconciliation, sending status to Nexus when all regions
are consistent and then the pantry would shut itself down.

Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I asked for it, but now I have a complication for you.

When doing an initial reconciliation, it's possible that all three downstairs have
different extents that need correction. Unlikely, but possible.

How would we represent this? By having all three regions report they are being
repaired? I think that can work, but it may look strange from the outside.

nexus/db-model/src/schema.rs Show resolved Hide resolved
nexus/db-model/src/upstairs_repair.rs Show resolved Hide resolved
nexus/db-queries/src/db/datastore/volume.rs Outdated Show resolved Hide resolved
schema/crdb/37.0.0/up01.sql Outdated Show resolved Hide resolved
@jmpesp
Copy link
Contributor Author

jmpesp commented Feb 28, 2024

When doing an initial reconciliation, it's possible that all three downstairs have different extents that need correction. Unlikely, but possible.

How would we represent this? By having all three regions report they are being repaired? I think that can work, but it may look strange from the outside.

This would look like three Reconciliation records all with the same repair id, which we can then tell is this unlikely scenario :)

@jmpesp
Copy link
Contributor Author

jmpesp commented Feb 28, 2024

I'm just thinking about future us trying to determine why a Repair was started. Having some fields in the database to leave us breadcrumbs might be helpful.

Back to this - I was thinking what a breadcrumb would get us over now having the repair type enum.

If it's Live, then there are two reasons for starting live repair (that I know of): receiving an IO error for a write / flush / extent repair op from a Downstairs, or the queue of live work exceeds IO_OUTSTANDING_MAX.

If it's Reconciliation, then that's when extents don't match at startup.

We could plumb up ClientStopReason?

@leftwo
Copy link
Contributor

leftwo commented Feb 29, 2024

I'm just thinking about future us trying to determine why a Repair was started. Having some fields in the database to leave us breadcrumbs might be helpful.

Back to this - I was thinking what a breadcrumb would get us over now having the repair type enum.

If it's Live, then there are two reasons for starting live repair (that I know of): receiving an IO error for a write / flush / extent repair op from a Downstairs, or the queue of live work exceeds IO_OUTSTANDING_MAX.

If it's Reconciliation, then that's when extents don't match at startup.

We could plumb up ClientStopReason?

Yeah, I guess we would have to make up types for each possible reason.

@leftwo
Copy link
Contributor

leftwo commented Mar 1, 2024

Back to this - I was thinking what a breadcrumb would get us over now having the repair type enum.
If it's Live, then there are two reasons for starting live repair (that I know of): receiving an IO error for a write / flush / extent repair op from a Downstairs, or the queue of live work exceeds IO_OUTSTANDING_MAX.
If it's Reconciliation, then that's when extents don't match at startup.
We could plumb up ClientStopReason?

Yeah, I guess we would have to make up types for each possible reason.

Oh, if Nexus tells the Upstairs that it has to replace a downstairs due to a pending upgrade on that
sled, that would be another reason. The Upstairs should know the difference though between a
LiveRepair that started as a result of a replacement, and a LiveRepair that started because there was
an error.

Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to fine tune this as we get further down the path with ds replacement,
but it seems like this is what we think we need now. If we need to update it later
we can do that.

nexus/db-model/src/schema.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/volume.rs Outdated Show resolved Hide resolved
nexus/tests/integration_tests/volume_management.rs Outdated Show resolved Hide resolved
'incompatible',
'failed_live_repair',
'too_many_outstanding_jobs',
'deactivated'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a timeout choice too? It could be either a connection closed from the remote side, or just nothing and we gave up on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, yeah - ClientStopReason is recorded now but not if the client task has a ClientRunResult. I'll do that now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df79ff2 adds separate endpoints for when an Upstairs requests a stop, and when a Downstairs client task stops.

@jmpesp jmpesp changed the title Accept live repair status reports from Crucible Accept notifications from Crucible Mar 12, 2024
@jmpesp jmpesp enabled auto-merge (squash) March 14, 2024 15:53
@jmpesp jmpesp merged commit 2406d9d into oxidecomputer:main Mar 14, 2024
17 checks passed
@jmpesp jmpesp deleted the crucible_repair_status_reports branch March 22, 2024 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Accept repair status reports from Crucible
2 participants