Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry on transient error when waiting for snapshot to be ready #2508

Merged
merged 5 commits into from
Dec 8, 2023

Conversation

hairyhum
Copy link
Contributor

@hairyhum hairyhum commented Dec 1, 2023

Change Overview

CSI snapshot controller might add errors to the snapshot status which it will recover from.

Make snapshotter.WaitOnReadyToUse retry (up to 100 times) on those errors. Backoff mechanism makes it so 100 retries is minutes, hopefuly should be enough for most cases.

Unfortunately CSI snapshotter uses strings to inform of error reason and does not provide error code or type when reporting, hence for now we use regexp to match on transient error. If CSI snapshotter uses better error format in the future, we can also change that.

Pull request type

Please check the type of change your PR introduces:

  • 🚧 Work in Progress
  • 🌈 Refactoring (no functional changes, no api changes)
  • 🐹 Trivial/Minor
  • 🐛 Bugfix
  • 🌻 Feature
  • 🗺️ Documentation
  • 🤖 Test

Test Plan

  • 💪 Manual
  • ⚡ Unit test
  • 💚 E2E

Follow up

Need to run some integration tests and monitor future issues in case we need to adjust the backoff retries count.
Monitor kubernetes-csi/external-snapshotter#748 as those improvements could reduce the chance of errors (probably won't eliminate them completely though)
Monitor kubernetes-csi/external-snapshotter#970 if there are improvements to the format of error

CSI snapshot controller might add errors to the snapshot status which it will
recover from.

Make snapshotter.WaitOnReadyToUse retry (up to 100 times) on those errors.
Backoff mechanism makes it so 100 retries is minutes, hopefuly should be enough for
most cases.

Unfortunately CSI snapshotter uses strings to inform of error reason and
does not provide error code or type when reporting, hence for now we use regexp to
match on transient error. If CSI snapshotter uses better error format in the future,
we can also change that.
pkg/kube/snapshot/snapshot_alpha.go Outdated Show resolved Hide resolved
pkg/kube/snapshot/snapshot_stable.go Show resolved Hide resolved
pkg/kube/snapshot/snapshot.go Outdated Show resolved Hide resolved
return false, err
}
// Error can be set while waiting for creation
if vs.Status.Error != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to confirm that the events are in VolumeSnapshot resource right, and not in the VolumeSnapshotContent resource?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I don't think we read VolumeSnapshotContent status anywhere in our code. But status errors are propagated by snapshot_controller to VolumeSnapshot if there is an error in VolumeSnapshotContent as I understand.

@viveksinghggits
Copy link
Contributor

@hairyhum do we know that this error eventually always goes away and VolumeSnapshot gets into the ReadyToUse state?

@hairyhum
Copy link
Contributor Author

hairyhum commented Dec 4, 2023

@hairyhum do we know that this error eventually always goes away and VolumeSnapshot gets into the ReadyToUse state?

For that particular error - yes, it is recoverable eventually, other errors may be not. But there is also a retry limit in case shapshot controller fails to recover from that one.

@hairyhum hairyhum force-pushed the snapshot-transient-error branch from 26e4eab to 9f33440 Compare December 4, 2023 17:57
@hairyhum hairyhum force-pushed the snapshot-transient-error branch from 9f33440 to 026debb Compare December 4, 2023 19:27
isReadyFunc func(*unstructured.Unstructured) (bool, error),
) error {
retries := 100
return poll.WaitWithRetries(ctx, retries, isTransientError, func(context.Context) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is this the right poll to use here? Why is 100 retries the correct value?

It appears to me that there is currently only 1 other file where Kanister code uses poll.WaitWithRetries.

It is more common to use poll.Wait, sometimes with the ctx set to have a timeout. Or to call poll.WaitWithBackoff and a maximum time set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do that. The difference here is that we only retry on this specific type of error, while poll.WaitWithBackoff would not distinguish that.
The 100 value is pretty arbitrary just to avoid the infinite wait. We could use poll.WaitWithBackoffWithRetries to configure max time to wait as well, but I don't know what's the good value in this context.

Current setting for the context timeout is infinity because there's a long-running process. Inserting a new time limit here would mean potentially creating a new failure scenario. Setting retries would worst case result in similar behaviour to what we have right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @pavannd1 indicated we can check this in as is and iterate on the polling mechanism-- this will improve robustness as is.

We can retry only on a specific error when using poll.Wait or poll.WaitWithBackoff by making the check for that error within the function called by poll. See for example how an apierrors.IsNotFound error is handled in RepoServiceHandler.createService.

See

err = poll.WaitWithBackoff(ctx, backoff.Backoff{

Yes, there are some subtle differences between terminating a poll with a timeout or by retry count-- in particular whether ctx expiration can affect a call made within the poll loop. But I don't think those differences matter when the call being made is just a Get.

Ultimately the retry count is going to be picked based on a some knowledge of how long it is reasonable to wait for the snapshot controller to resolve the transient. In my mind it is clearer to just express that as a time value instead of trying to calculate the retry count from desired time and the backoff parameters.

See the many instances of poll.Wait(timeoutCtx,... in pkg/app where an app-specific timeout is known and used. Or pkg/kube.WaitForPodReady.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon further reflection, I've realized that the code as implemented has the following behavior:

  • Will wait indefinitely for the controller to update the status to be complete or some form of error.
  • Once it encounters a transient error it will poll a bounded number of times (with default backoff parameters)
  • If a non-transient error is encountered it will return an error immediately.

Importantly, setting a timeout on the context at beginning of all polling will affect the first wait, the wait for controller to ready snapshot before setting any error, transient or not.

While I still think it would be better to have an explicit time limit on retrying of transients, and we probably could work out a way to build it into a retry function closure, at that point it is no longer simple and clear.

Best solution is probably to keep retry count on trasients and add a comment about expected time that count corresponds to using default backoff.

Copy link
Contributor

@pavannd1 pavannd1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor improvements suggested. We can merge and iterate on the polling mechanism.

// Helpers to work with volume snapshot status used in TestWaitOnReadyToUse
// ----------------------------------------------------------------------------

func waitOnReadyToUseInBackground(c *C, ctx context.Context, fakeSs snapshot.Snapshotter, snapshotName string, namespace string, timeout time.Duration) chan error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func waitOnReadyToUseInBackground(c *C, ctx context.Context, fakeSs snapshot.Snapshotter, snapshotName string, namespace string, timeout time.Duration) chan error {
func waitOnReadyToUseInBackground(
c *C,
ctx context.Context,
fakeSs snapshot.Snapshotter,
snapshotName,
namespace string,
timeout time.Duration,
) chan error {

[optional] formatting for readability

return reply
}

func waitOnReadyToUseWithTimeout(c *C, ctx context.Context, fakeSs snapshot.Snapshotter, snapshotName string, namespace string, timeout time.Duration) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func waitOnReadyToUseWithTimeout(c *C, ctx context.Context, fakeSs snapshot.Snapshotter, snapshotName string, namespace string, timeout time.Duration) error {
func waitOnReadyToUseWithTimeout(
c *C,
ctx context.Context,
fakeSs snapshot.Snapshotter,
snapshotName,
namespace string,
timeout time.Duration,
) error {

return err
}

func setReadyStatus(c *C, dynCli *dynfake.FakeDynamicClient, volumeSnapshotGVR schema.GroupVersionResource, snapshotName string, namespace string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func setReadyStatus(c *C, dynCli *dynfake.FakeDynamicClient, volumeSnapshotGVR schema.GroupVersionResource, snapshotName string, namespace string) {
func setReadyStatus(
c *C,
dynCli *dynfake.FakeDynamicClient,
volumeSnapshotGVR schema.GroupVersionResource,
snapshotName,
namespace string,
) {

setVolumeSnapshotStatus(c, dynCli, volumeSnapshotGVR, snapshotName, namespace, status)
}

func setErrorStatus(c *C, dynCli *dynfake.FakeDynamicClient, volumeSnapshotGVR schema.GroupVersionResource, snapshotName string, namespace string, message string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func setErrorStatus(c *C, dynCli *dynfake.FakeDynamicClient, volumeSnapshotGVR schema.GroupVersionResource, snapshotName string, namespace string, message string) {
func setErrorStatus(
c *C,
dynCli *dynfake.FakeDynamicClient,
volumeSnapshotGVR schema.GroupVersionResource,
snapshotName,
namespace,
message string,
) {

setVolumeSnapshotStatus(c, dynCli, volumeSnapshotGVR, snapshotName, namespace, status)
}

func setVolumeSnapshotStatus(c *C, dynCli *dynfake.FakeDynamicClient, volumeSnapshotGVR schema.GroupVersionResource, snapshotName string, namespace string, status map[string]interface{}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func setVolumeSnapshotStatus(c *C, dynCli *dynfake.FakeDynamicClient, volumeSnapshotGVR schema.GroupVersionResource, snapshotName string, namespace string, status map[string]interface{}) {
func setVolumeSnapshotStatus(
c *C,
dynCli *dynfake.FakeDynamicClient,
volumeSnapshotGVR schema.GroupVersionResource,
snapshotName,
namespace string,
status map[string]interface{},
) {

Copy link
Contributor

@ewhamilton ewhamilton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll approve as is-- it's better than what we have and correct if not ideal.

I do think switching to an explicit time limit on the poll will be a significant improvement in understandability and maintainability.

Note: See other note-- any switch to time should only be for time limit on transient error, not time limit waiting for initial response by controller.

@hairyhum hairyhum added the kueue label Dec 8, 2023
@mergify mergify bot merged commit 4aefd25 into master Dec 8, 2023
15 checks passed
@mergify mergify bot deleted the snapshot-transient-error branch December 8, 2023 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants