Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add exporter retry configuration #97

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

brettmc
Copy link
Contributor

@brettmc brettmc commented Jun 14, 2024

https://opentelemetry.io/docs/specs/otlp/#otlphttp-throttling and https://opentelemetry.io/docs/specs/otlp/#otlpgrpc-throttling describe how a client SHOULD implement an exponential backoff strategy (with jitter) in case of retryable export failures. The inputs to this strategy are usually the initial delay, and max attempts.
Also added to zipkin exporter, since it could also be implemented there.

https://opentelemetry.io/docs/specs/otlp/#otlphttp-throttling and https://opentelemetry.io/docs/specs/otlp/#otlpgrpc-throttling describe how a client SHOULD implement an exponential backoff strategy (with jitter) in case of retryable export failures.
The inputs to this strategy are usually the initial delay, and max attempts.
@brettmc brettmc requested a review from a team June 14, 2024 06:22
@@ -166,6 +166,12 @@
"type": "integer",
"minimum": 0
},
"insecure": {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved insecure up to be consistent with common.json/Otlp

Copy link
Contributor

@codeboten codeboten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @brettmc for the PR! I wonder how consistent the implementations of this retry are. In Go the options are presented slightly differently:

	// Enabled indicates whether to not retry sending batches in case of
	// export failure.
	Enabled bool
	// InitialInterval the time to wait after the first failure before
	// retrying.
	InitialInterval time.Duration
	// MaxInterval is the upper bound on backoff interval. Once this value is
	// reached the delay between consecutive retries will always be
	// `MaxInterval`.
	MaxInterval time.Duration
	// MaxElapsedTime is the maximum amount of time (including retries) spent
	// trying to send a request/batch.  Once this value is reached, the data
	// is discarded.
	MaxElapsedTime time.Duration

And in Python, the only option is a max interval:

def _create_exp_backoff_generator(max_value: int = 0) -> Iterator[int]:
    """
    Generates an infinite sequence of exponential backoff values. The sequence starts
    from 1 (2^0) and doubles each time (2^1, 2^2, 2^3, ...). If a max_value is specified
    and non-zero, the generated values will not exceed this maximum, capping at max_value
    instead of growing indefinitely.

    Parameters:
    - max_value (int, optional): The maximum value to yield. If 0 or not provided, the
      sequence grows without bound.

@brettmc
Copy link
Contributor Author

brettmc commented Jul 10, 2024

I wonder how consistent the implementations of this retry are

It looks like not at all consistent, and I just found open-telemetry/opentelemetry-specification#3639 which describes it quite well:

The implementation of the exponential backoff strategy is not specified

It seems like the root of the issue then is that we cannot configure retry consistently across SIGs if we haven't agreed on the inputs. I can think of a couple of possible next steps:

  • collate all inputs in use by all of our implementations of retry policy, and allow all of those fields to be part of retry config (with SDKs choosing the fields they are interested in). It would work, but the same file-based configuration will lead to different behaviour across SIGs, which is not really in the spirit of this repo.
  • go back to the spec and try to come up with a specification for how retry should be configured, which everybody can agree on

@svrnm
Copy link
Member

svrnm commented Jul 10, 2024

go back to the spec and try to come up with a specification for how retry should be configured, which everybody can agree on

This sounds like the best path forward, with the other option being the fallback. @brettmc would you mind raising the spec issue, with

  • a reference to this issue
  • a reference to how PHP and Go does it

FYI, this is something related to open-telemetry/opentelemetry-specification#4083

@jack-berg
Copy link
Member

To add to @codeboten analysis of existing implementations, here is java's set of options

  • maxAttempts
  • initialBackoff
  • maxBackoff
  • backoffMultiplier

These options mirror gRPC's built in retry mechanism, since gRPC was very influential in the development of OTLP and it seemed reasonable that OTLP clients using the gRPC version of the protocol would utilize the gRPC clients' built in mechanism.

One concrete thing we can / should do while we sort out how these options are configurable across languages is simply give the ability to enable / disable retry. Regardless of the options, every langauge should support some retry mechanism since the spec requires it in clear terms:

Transient errors MUST be handled with a retry strategy. This retry strategy MUST implement an exponential back-off with jitter to avoid overwhelming the destination until the network is restored or the destination has recovered.

We could do this by providing a top level retry_enabled property for the OTLP exporter type:

otlp:
  endpoint: http://localhost:4318
  retry_enabled: true
  # add in the future
  # retry_options:

Later when we add specific options we could add a separate retry_options type.

Or we could define a new retry type, with only a enabled property to start, with the intent to expand later when we decide on the options:

otlp:
  endpoint: http://localhost:4318
  retry:
    enabled: true
    # add additional options in the future

@brettmc
Copy link
Contributor Author

brettmc commented Jul 10, 2024

Or we could define a new retry type, with only a enabled property to start, with the intent to expand later when we decide on the options

I like this one, combining all retry-related options into one type which we can expand on.

@jack-berg
Copy link
Member

I like this one, combining all retry-related options into one type which we can expand on.

👍 This sounds good to me. Do you have interest in updating this PR to that effect? Or did you want to wait until the specific options are worked out at the spec lever (which could be some time).

@brettmc
Copy link
Contributor Author

brettmc commented Jul 16, 2024

Do you have interest in updating this PR to that effect?

Updated to only implement retry.enabled (boolean, default true). The rest of the configuration will need to wait until there is agreement on what inputs are required.

examples/kitchen-sink.yaml Outdated Show resolved Hide resolved
examples/kitchen-sink.yaml Outdated Show resolved Hide resolved
# Configure retry policy
retry:
# Configure retry enabled
enabled: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we buy my argument that enabled: true by default, then we should rename it to disabled to align with spec naming conventions for booleans:

All Boolean environment variables SHOULD be named and defined such that false is the expected safe default behavior.

# Configure retry policy.
retry:
# Configure retry disabled.
disabled: false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of @carlosalberto's suggestion to have a strategy field, defaulting to exponential_retry and supporting none.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like it, and I think it could resolve my issues. If we allow for future retry strategies, including custom/3rd party strategies, then SIGs have the freedom to provide their own configuration to suit their specific implementation.
If the official exponential retry strategy is locked down in spec, SIGs can migrate across to it at their leisure (w.r.t using it with file-based configuration).

How about treating retry policy similarly to how we treat a Sampler - it's configuration is optional and defaults to exponential_retry, at most one can be provided, and custom is allowed? Something like:

retry:
  disabled: false
  <policy-name>:
    property: value

disabled could actually be replaced by a noop/none/noretry policy, which might make the interface between an exporter and the retry policy a little cleaner.

@@ -12,6 +12,8 @@ exporters:
api-key: !!str 1234
compression: gzip
timeout: 10000
retry:
disabled: false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enabled seems to be easier - avoids double negation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec defines that options like this have to be disabled and not enabled. But I can't find it right now... so I may be wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jack-berg
Copy link
Member

@brettmc it doesn't seem like there's consensus to add the corresponding option in the spec (spec PR #4148). If you know of use cases where PHP users want to disable retry, please comment on that PR.

Else, it seems like we are blocked on this until specification issues are resolved.

@codeboten codeboten added the blocked:spec The issue is blocked on having spec language on the topic. label Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked:spec The issue is blocked on having spec language on the topic.
Projects
Development

Successfully merging this pull request may close these issues.

7 participants