-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add exporter retry configuration #97
base: main
Are you sure you want to change the base?
Conversation
https://opentelemetry.io/docs/specs/otlp/#otlphttp-throttling and https://opentelemetry.io/docs/specs/otlp/#otlpgrpc-throttling describe how a client SHOULD implement an exponential backoff strategy (with jitter) in case of retryable export failures. The inputs to this strategy are usually the initial delay, and max attempts.
@@ -166,6 +166,12 @@ | |||
"type": "integer", | |||
"minimum": 0 | |||
}, | |||
"insecure": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved insecure
up to be consistent with common.json/Otlp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @brettmc for the PR! I wonder how consistent the implementations of this retry are. In Go the options are presented slightly differently:
// Enabled indicates whether to not retry sending batches in case of
// export failure.
Enabled bool
// InitialInterval the time to wait after the first failure before
// retrying.
InitialInterval time.Duration
// MaxInterval is the upper bound on backoff interval. Once this value is
// reached the delay between consecutive retries will always be
// `MaxInterval`.
MaxInterval time.Duration
// MaxElapsedTime is the maximum amount of time (including retries) spent
// trying to send a request/batch. Once this value is reached, the data
// is discarded.
MaxElapsedTime time.Duration
And in Python, the only option is a max interval:
def _create_exp_backoff_generator(max_value: int = 0) -> Iterator[int]:
"""
Generates an infinite sequence of exponential backoff values. The sequence starts
from 1 (2^0) and doubles each time (2^1, 2^2, 2^3, ...). If a max_value is specified
and non-zero, the generated values will not exceed this maximum, capping at max_value
instead of growing indefinitely.
Parameters:
- max_value (int, optional): The maximum value to yield. If 0 or not provided, the
sequence grows without bound.
It looks like not at all consistent, and I just found open-telemetry/opentelemetry-specification#3639 which describes it quite well:
It seems like the root of the issue then is that we cannot configure retry consistently across SIGs if we haven't agreed on the inputs. I can think of a couple of possible next steps:
|
This sounds like the best path forward, with the other option being the fallback. @brettmc would you mind raising the spec issue, with
FYI, this is something related to open-telemetry/opentelemetry-specification#4083 |
To add to @codeboten analysis of existing implementations, here is java's set of options
These options mirror gRPC's built in retry mechanism, since gRPC was very influential in the development of OTLP and it seemed reasonable that OTLP clients using the gRPC version of the protocol would utilize the gRPC clients' built in mechanism. One concrete thing we can / should do while we sort out how these options are configurable across languages is simply give the ability to enable / disable retry. Regardless of the options, every langauge should support some retry mechanism since the spec requires it in clear terms:
We could do this by providing a top level
Later when we add specific options we could add a separate Or we could define a new
|
I like this one, combining all retry-related options into one type which we can expand on. |
👍 This sounds good to me. Do you have interest in updating this PR to that effect? Or did you want to wait until the specific options are worked out at the spec lever (which could be some time). |
Updated to only implement retry.enabled (boolean, default true). The rest of the configuration will need to wait until there is agreement on what inputs are required. |
examples/kitchen-sink.yaml
Outdated
# Configure retry policy | ||
retry: | ||
# Configure retry enabled | ||
enabled: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we buy my argument that enabled: true
by default, then we should rename it to disabled
to align with spec naming conventions for booleans:
All Boolean environment variables SHOULD be named and defined such that false is the expected safe default behavior.
# Configure retry policy. | ||
retry: | ||
# Configure retry disabled. | ||
disabled: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of @carlosalberto's suggestion to have a strategy
field, defaulting to exponential_retry
and supporting none
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like it, and I think it could resolve my issues. If we allow for future retry strategies, including custom/3rd party strategies, then SIGs have the freedom to provide their own configuration to suit their specific implementation.
If the official exponential retry strategy is locked down in spec, SIGs can migrate across to it at their leisure (w.r.t using it with file-based configuration).
How about treating retry policy similarly to how we treat a Sampler
- it's configuration is optional and defaults to exponential_retry
, at most one can be provided, and custom is allowed? Something like:
retry:
disabled: false
<policy-name>:
property: value
disabled
could actually be replaced by a noop
/none
/noretry
policy, which might make the interface between an exporter and the retry policy a little cleaner.
@@ -12,6 +12,8 @@ exporters: | |||
api-key: !!str 1234 | |||
compression: gzip | |||
timeout: 10000 | |||
retry: | |||
disabled: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enabled
seems to be easier - avoids double negation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec defines that options like this have to be disabled
and not enabled
. But I can't find it right now... so I may be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, just saw Jack already linked to it earlier, https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/configuration/sdk-environment-variables.md#boolean-value
https://opentelemetry.io/docs/specs/otlp/#otlphttp-throttling and https://opentelemetry.io/docs/specs/otlp/#otlpgrpc-throttling describe how a client SHOULD implement an exponential backoff strategy (with jitter) in case of retryable export failures. The inputs to this strategy are usually the initial delay, and max attempts.
Also added to zipkin exporter, since it could also be implemented there.