Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retries are not limited #18

Open
metadaddy opened this issue Nov 7, 2024 · 3 comments
Open

Retries are not limited #18

metadaddy opened this issue Nov 7, 2024 · 3 comments

Comments

@metadaddy
Copy link
Member

Currently, if a B2 API returns a HTTP error status code that indicates a transient error, such as 429 or 503, there is no limit to the number of retries that will be attempted. The app calling the API will hang until the operation returns a non-transient status such as 200 or 404.

The issue is in the withBackoff() function:

func withBackoff(ctx context.Context, ri beRootInterface, f func() error) error {
	backoff := 500 * time.Millisecond
	for {
		err := f()
		if !ri.transient(err) {
			return err
		}
		bo := ri.backoff(err)
		if bo > 0 {
			backoff = bo
		} else {
			backoff = getBackoff(backoff)
		}
		select {
		case <-ctx.Done():
			return ctx.Err()
		case <-after(backoff):
		}
	}
}

Notice that the for loop has no terminating condition. There needs to be a configurable maximum number of retries so that operations can fail if the service is unavailable.

Note: I have looked at fixing this in the past. At the 'micro' level, it is straightforward to make this function do the right thing. The issues seem to arise in the Writer with multipart uploads, where multiple parts can be in flight concurrently.

@metadaddy
Copy link
Member Author

I just pushed #19 - it contains a test function, TestWriterLimitNewUploadAttempts, that I wrote. It may be useful.

@mlech-reef
Copy link
Collaborator

mlech-reef commented Nov 10, 2024

Hi. I just started looking into it, wanted to clarify few things before I start making any changes in the code.

There needs to be a configurable maximum number of retries so that operations can fail if the service is unavailable.

What would be the desired default value here?

Note: I have looked at fixing this in the past. At the 'micro' level, it is straightforward to make this function do the right thing. The issues seem to arise in the Writer with multipart uploads, where multiple parts can be in flight concurrently.

I don't understand why concurrently is an issue here. If we have e.g. maxRetries flag somewhere, and it has value of e.g. 3, then each concurrent thread will retry 3 time before it fails. Not like that?

Thank you for clarification.

Best regards,
Maciek

@metadaddy
Copy link
Member Author

Hi @mlech-reef -

There needs to be a configurable maximum number of retries so that operations can fail if the service is unavailable.

What would be the desired default value here?

Looking at the B2 SDK for Python, upload/download operations are retried 20 times, while other operations are retried 5 times.

I don't understand why concurrently is an issue here. If we have e.g. maxRetries flag somewhere, and it has value of e.g. 3, then each concurrent thread will retry 3 time before it fails. Not lie that?

My recollection is that, when an upload failed after 3 retries, I was trying to cancel all the other 'in-flight' uploads, to 'fail fast'. I was probably making it too complicated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants