Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accept repair status reports from Crucible #5120

Closed
jmpesp opened this issue Feb 22, 2024 · 3 comments · Fixed by #5135
Closed

Accept repair status reports from Crucible #5120

jmpesp opened this issue Feb 22, 2024 · 3 comments · Fixed by #5135
Assignees

Comments

@jmpesp
Copy link
Contributor

jmpesp commented Feb 22, 2024

Implement the server side of oxidecomputer/crucible#839 by accepting the following notifications:

  • live repair has started
  • live repair has completed either ok or with an error
@jmpesp jmpesp self-assigned this Feb 22, 2024
@leftwo
Copy link
Contributor

leftwo commented Feb 23, 2024

Crucible #839 was actually for notifying Nexus that the initial startup of an upstairs required
reconciliation, not LiveRepair.

However, the issue is really the same for Reconciliation as it is for LiveRepair, the difference is
just when we are doing the reporting.
Reconciliation: Before the upstairs goes active and starts allowing IOs.
LiveRepair: After the upstairs is active and has already been taking IOs.

I suspect the reporting part will still be the same between the two.

Another part we should also consider is if the Pantry is doing a Repair, that
path can also report back using the same mechanism.

jmpesp added a commit to jmpesp/omicron that referenced this issue Feb 23, 2024
Allow any Upstairs to notify Nexus about the start or completion (plus
status) of live repairs. The motivation for this was to be used in the
final stage of region replacement to notify Nexus that the replacement
has finished, but more generally this can be used to keep track of how
many times repair occurs for each region.

Fixes oxidecomputer#5120
@jmpesp
Copy link
Contributor Author

jmpesp commented Feb 23, 2024

I suspect the reporting part will still be the same between the two.

#5135 only accepts live repair status reports. I could also add a PR that accepts reconciliation status reports? It looks like the intention is to provide some sort of progress indication so that it can be exposed to users, otherwise they're just sitting there waiting?

Another part we should also consider is if the Pantry is doing a Repair, that path can also report back using the same mechanism.

Reports will come from all running Upstairs :)

@leftwo
Copy link
Contributor

leftwo commented Feb 23, 2024

I suspect the reporting part will still be the same between the two.

#5135 only accepts live repair status reports. I could also add a PR that accepts reconciliation status reports? It looks like the intention is to provide some sort of progress indication so that it can be exposed to users, otherwise they're just sitting there waiting?

Yes. With a notification only the user could know "we are working on it", but ideally some sort of progress notification as a possibility.

Another part we should also consider is if the Pantry is doing a Repair, that path can also report back using the same mechanism.

Reports will come from all running Upstairs :)

Progress reporting for LiveRepair is also probably worth doing, though maybe not quite as important as the user can still use the volume in question while it is being repaired.

jmpesp added a commit that referenced this issue Mar 14, 2024
Allow any Upstairs to notify Nexus about the start or completion (plus
status) of live repairs. The motivation for this was to be used in the
final stage of region replacement to notify Nexus that the replacement
has finished, but more generally this can be used to keep track of how
many times repair occurs for each region.

Also accept notifications for:

- when a downstairs client is requested to stop
- when a downstairs client stops

These will be used as breadcrumbs to determine when downstairs were
having problems, why repairs started in the first place, and more.

Fixes #5120
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants