-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry on transient error when waiting for snapshot to be ready #2508
Conversation
CSI snapshot controller might add errors to the snapshot status which it will recover from. Make snapshotter.WaitOnReadyToUse retry (up to 100 times) on those errors. Backoff mechanism makes it so 100 retries is minutes, hopefuly should be enough for most cases. Unfortunately CSI snapshotter uses strings to inform of error reason and does not provide error code or type when reporting, hence for now we use regexp to match on transient error. If CSI snapshotter uses better error format in the future, we can also change that.
Move retry logic to snapshot.go
return false, err | ||
} | ||
// Error can be set while waiting for creation | ||
if vs.Status.Error != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to confirm that the events are in VolumeSnapshot
resource right, and not in the VolumeSnapshotContent
resource?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I don't think we read VolumeSnapshotContent
status anywhere in our code. But status errors are propagated by snapshot_controller to VolumeSnapshot
if there is an error in VolumeSnapshotContent
as I understand.
@hairyhum do we know that this error eventually always goes away and |
For that particular error - yes, it is recoverable eventually, other errors may be not. But there is also a retry limit in case shapshot controller fails to recover from that one. |
26e4eab
to
9f33440
Compare
9f33440
to
026debb
Compare
isReadyFunc func(*unstructured.Unstructured) (bool, error), | ||
) error { | ||
retries := 100 | ||
return poll.WaitWithRetries(ctx, retries, isTransientError, func(context.Context) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Is this the right poll to use here? Why is 100 retries the correct value?
It appears to me that there is currently only 1 other file where Kanister code uses poll.WaitWithRetries.
It is more common to use poll.Wait
, sometimes with the ctx set to have a timeout. Or to call poll.WaitWithBackoff
and a maximum time set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do that. The difference here is that we only retry on this specific type of error, while poll.WaitWithBackoff
would not distinguish that.
The 100
value is pretty arbitrary just to avoid the infinite wait. We could use poll.WaitWithBackoffWithRetries
to configure max time to wait as well, but I don't know what's the good value in this context.
Current setting for the context timeout is infinity because there's a long-running process. Inserting a new time limit here would mean potentially creating a new failure scenario. Setting retries would worst case result in similar behaviour to what we have right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @pavannd1 indicated we can check this in as is and iterate on the polling mechanism-- this will improve robustness as is.
We can retry only on a specific error when using poll.Wait or poll.WaitWithBackoff by making the check for that error within the function called by poll. See for example how an apierrors.IsNotFound
error is handled in RepoServiceHandler.createService.
See
err = poll.WaitWithBackoff(ctx, backoff.Backoff{ |
Yes, there are some subtle differences between terminating a poll with a timeout or by retry count-- in particular whether ctx expiration can affect a call made within the poll loop. But I don't think those differences matter when the call being made is just a Get.
Ultimately the retry count is going to be picked based on a some knowledge of how long it is reasonable to wait for the snapshot controller to resolve the transient. In my mind it is clearer to just express that as a time value instead of trying to calculate the retry count from desired time and the backoff parameters.
See the many instances of poll.Wait(timeoutCtx,...
in pkg/app where an app-specific timeout is known and used. Or pkg/kube.WaitForPodReady.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon further reflection, I've realized that the code as implemented has the following behavior:
- Will wait indefinitely for the controller to update the status to be complete or some form of error.
- Once it encounters a transient error it will poll a bounded number of times (with default backoff parameters)
- If a non-transient error is encountered it will return an error immediately.
Importantly, setting a timeout on the context at beginning of all polling will affect the first wait, the wait for controller to ready snapshot before setting any error, transient or not.
While I still think it would be better to have an explicit time limit on retrying of transients, and we probably could work out a way to build it into a retry function closure, at that point it is no longer simple and clear.
Best solution is probably to keep retry count on trasients and add a comment about expected time that count corresponds to using default backoff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor improvements suggested. We can merge and iterate on the polling mechanism.
pkg/kube/snapshot/snapshot_test.go
Outdated
// Helpers to work with volume snapshot status used in TestWaitOnReadyToUse | ||
// ---------------------------------------------------------------------------- | ||
|
||
func waitOnReadyToUseInBackground(c *C, ctx context.Context, fakeSs snapshot.Snapshotter, snapshotName string, namespace string, timeout time.Duration) chan error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func waitOnReadyToUseInBackground(c *C, ctx context.Context, fakeSs snapshot.Snapshotter, snapshotName string, namespace string, timeout time.Duration) chan error { | |
func waitOnReadyToUseInBackground( | |
c *C, | |
ctx context.Context, | |
fakeSs snapshot.Snapshotter, | |
snapshotName, | |
namespace string, | |
timeout time.Duration, | |
) chan error { |
[optional] formatting for readability
pkg/kube/snapshot/snapshot_test.go
Outdated
return reply | ||
} | ||
|
||
func waitOnReadyToUseWithTimeout(c *C, ctx context.Context, fakeSs snapshot.Snapshotter, snapshotName string, namespace string, timeout time.Duration) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func waitOnReadyToUseWithTimeout(c *C, ctx context.Context, fakeSs snapshot.Snapshotter, snapshotName string, namespace string, timeout time.Duration) error { | |
func waitOnReadyToUseWithTimeout( | |
c *C, | |
ctx context.Context, | |
fakeSs snapshot.Snapshotter, | |
snapshotName, | |
namespace string, | |
timeout time.Duration, | |
) error { |
pkg/kube/snapshot/snapshot_test.go
Outdated
return err | ||
} | ||
|
||
func setReadyStatus(c *C, dynCli *dynfake.FakeDynamicClient, volumeSnapshotGVR schema.GroupVersionResource, snapshotName string, namespace string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func setReadyStatus(c *C, dynCli *dynfake.FakeDynamicClient, volumeSnapshotGVR schema.GroupVersionResource, snapshotName string, namespace string) { | |
func setReadyStatus( | |
c *C, | |
dynCli *dynfake.FakeDynamicClient, | |
volumeSnapshotGVR schema.GroupVersionResource, | |
snapshotName, | |
namespace string, | |
) { |
pkg/kube/snapshot/snapshot_test.go
Outdated
setVolumeSnapshotStatus(c, dynCli, volumeSnapshotGVR, snapshotName, namespace, status) | ||
} | ||
|
||
func setErrorStatus(c *C, dynCli *dynfake.FakeDynamicClient, volumeSnapshotGVR schema.GroupVersionResource, snapshotName string, namespace string, message string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func setErrorStatus(c *C, dynCli *dynfake.FakeDynamicClient, volumeSnapshotGVR schema.GroupVersionResource, snapshotName string, namespace string, message string) { | |
func setErrorStatus( | |
c *C, | |
dynCli *dynfake.FakeDynamicClient, | |
volumeSnapshotGVR schema.GroupVersionResource, | |
snapshotName, | |
namespace, | |
message string, | |
) { |
pkg/kube/snapshot/snapshot_test.go
Outdated
setVolumeSnapshotStatus(c, dynCli, volumeSnapshotGVR, snapshotName, namespace, status) | ||
} | ||
|
||
func setVolumeSnapshotStatus(c *C, dynCli *dynfake.FakeDynamicClient, volumeSnapshotGVR schema.GroupVersionResource, snapshotName string, namespace string, status map[string]interface{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func setVolumeSnapshotStatus(c *C, dynCli *dynfake.FakeDynamicClient, volumeSnapshotGVR schema.GroupVersionResource, snapshotName string, namespace string, status map[string]interface{}) { | |
func setVolumeSnapshotStatus( | |
c *C, | |
dynCli *dynfake.FakeDynamicClient, | |
volumeSnapshotGVR schema.GroupVersionResource, | |
snapshotName, | |
namespace string, | |
status map[string]interface{}, | |
) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll approve as is-- it's better than what we have and correct if not ideal.
I do think switching to an explicit time limit on the poll will be a significant improvement in understandability and maintainability.
Note: See other note-- any switch to time should only be for time limit on transient error, not time limit waiting for initial response by controller.
Change Overview
CSI snapshot controller might add errors to the snapshot status which it will recover from.
Make snapshotter.WaitOnReadyToUse retry (up to 100 times) on those errors. Backoff mechanism makes it so 100 retries is minutes, hopefuly should be enough for most cases.
Unfortunately CSI snapshotter uses strings to inform of error reason and does not provide error code or type when reporting, hence for now we use regexp to match on transient error. If CSI snapshotter uses better error format in the future, we can also change that.
Pull request type
Please check the type of change your PR introduces:
Test Plan
Follow up
Need to run some integration tests and monitor future issues in case we need to adjust the backoff retries count.
Monitor kubernetes-csi/external-snapshotter#748 as those improvements could reduce the chance of errors (probably won't eliminate them completely though)
Monitor kubernetes-csi/external-snapshotter#970 if there are improvements to the format of error