-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accept notifications from Crucible #5135
Accept notifications from Crucible #5135
Conversation
Allow any Upstairs to notify Nexus about the start or completion (plus status) of live repairs. The motivation for this was to be used in the final stage of region replacement to notify Nexus that the replacement has finished, but more generally this can be used to keep track of how many times repair occurs for each region. Fixes oxidecomputer#5120
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some questions about this, to make sure I understand it :)
The plan is that we will be storing information about a volume under repair
in a new table in the database? But I'm not clear on how we connect this table
back to a specific volume.
I do think we should not call this LiveRepair..
and just call it Repair..
everywhere.
We can re-use all of this for the same mechanism that happens when an upstairs
first starts and needs to reconcile the three downstairs with each other.
It seems like it would not be too difficult to add an update
or status
endpoint
that could report to nexus the progress of a repair. Maybe we want to put a
column in the database now to allow us to start using it later?
Also, thinking about migration cases, do we want another column in the the
database to indicate why? I've not given this much thought yet, and I don't
think this particular table is going to be exposed to the user. I'm just thinking
about future us trying to determine why a Repair
was started. Having some
fields in the database to leave us breadcrumbs might be helpful.
There's going to be a region replacement request that can be tied back to the Volume by checking the connection addresses in the sub volumes. For read-write region replacements, the region under replacement can only be tied back to one volume, and because of the current region allocation strategy each region is allocated on a separate sled. Note this is to work around us not storing the region ID in the VCR, and because we have to assume that the region's Crucible agent will not be accessible.
Agreed, done in ff08b00
👍, done in 0cd8601
I'm also going to think more about this tonight. |
From the update meeting today, I think for sure we want to treat the initial reconciliation the same In that situation we will have to spin up a pantry (either before or after the replacement, probably |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I asked for it, but now I have a complication for you.
When doing an initial reconciliation, it's possible that all three downstairs have
different extents that need correction. Unlikely, but possible.
How would we represent this? By having all three regions report they are being
repaired? I think that can work, but it may look strange from the outside.
This would look like three |
Back to this - I was thinking what a breadcrumb would get us over now having the repair type enum. If it's If it's We could plumb up |
Yeah, I guess we would have to make up types for each possible reason. |
Oh, if Nexus tells the Upstairs that it has to replace a downstairs due to a pending upgrade on that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need to fine tune this as we get further down the path with ds replacement,
but it seems like this is what we think we need now. If we need to update it later
we can do that.
schema/crdb/41.0.0/up05.sql
Outdated
'incompatible', | ||
'failed_live_repair', | ||
'too_many_outstanding_jobs', | ||
'deactivated' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a timeout
choice too? It could be either a connection closed from the remote side, or just nothing and we gave up on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, yeah - ClientStopReason
is recorded now but not if the client task has a ClientRunResult. I'll do that now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df79ff2 adds separate endpoints for when an Upstairs requests a stop, and when a Downstairs client task stops.
Allow any Upstairs to notify Nexus about the start or completion (plus status) of live repairs. The motivation for this was to be used in the final stage of region replacement to notify Nexus that the replacement has finished, but more generally this can be used to keep track of how many times repair occurs for each region.
Fixes #5120