add exporter retry configuration #97

brettmc · 2024-06-14T06:22:41Z

https://opentelemetry.io/docs/specs/otlp/#otlphttp-throttling and https://opentelemetry.io/docs/specs/otlp/#otlpgrpc-throttling describe how a client SHOULD implement an exponential backoff strategy (with jitter) in case of retryable export failures. The inputs to this strategy are usually the initial delay, and max attempts.
Also added to zipkin exporter, since it could also be implemented there.

https://opentelemetry.io/docs/specs/otlp/#otlphttp-throttling and https://opentelemetry.io/docs/specs/otlp/#otlpgrpc-throttling describe how a client SHOULD implement an exponential backoff strategy (with jitter) in case of retryable export failures. The inputs to this strategy are usually the initial delay, and max attempts.

brettmc · 2024-06-27T05:49:14Z

schema/meter_provider.json

@@ -166,6 +166,12 @@
                    "type": "integer",
                    "minimum": 0
                },
+                "insecure": {


moved insecure up to be consistent with common.json/Otlp

codeboten

Thanks @brettmc for the PR! I wonder how consistent the implementations of this retry are. In Go the options are presented slightly differently:

	// Enabled indicates whether to not retry sending batches in case of
	// export failure.
	Enabled bool
	// InitialInterval the time to wait after the first failure before
	// retrying.
	InitialInterval time.Duration
	// MaxInterval is the upper bound on backoff interval. Once this value is
	// reached the delay between consecutive retries will always be
	// `MaxInterval`.
	MaxInterval time.Duration
	// MaxElapsedTime is the maximum amount of time (including retries) spent
	// trying to send a request/batch.  Once this value is reached, the data
	// is discarded.
	MaxElapsedTime time.Duration

And in Python, the only option is a max interval:

def _create_exp_backoff_generator(max_value: int = 0) -> Iterator[int]:
    """
    Generates an infinite sequence of exponential backoff values. The sequence starts
    from 1 (2^0) and doubles each time (2^1, 2^2, 2^3, ...). If a max_value is specified
    and non-zero, the generated values will not exceed this maximum, capping at max_value
    instead of growing indefinitely.

    Parameters:
    - max_value (int, optional): The maximum value to yield. If 0 or not provided, the
      sequence grows without bound.

brettmc · 2024-07-10T02:02:57Z

I wonder how consistent the implementations of this retry are

It looks like not at all consistent, and I just found open-telemetry/opentelemetry-specification#3639 which describes it quite well:

The implementation of the exponential backoff strategy is not specified

It seems like the root of the issue then is that we cannot configure retry consistently across SIGs if we haven't agreed on the inputs. I can think of a couple of possible next steps:

collate all inputs in use by all of our implementations of retry policy, and allow all of those fields to be part of retry config (with SDKs choosing the fields they are interested in). It would work, but the same file-based configuration will lead to different behaviour across SIGs, which is not really in the spirit of this repo.
go back to the spec and try to come up with a specification for how retry should be configured, which everybody can agree on

svrnm · 2024-07-10T07:36:35Z

go back to the spec and try to come up with a specification for how retry should be configured, which everybody can agree on

This sounds like the best path forward, with the other option being the fallback. @brettmc would you mind raising the spec issue, with

a reference to this issue
a reference to how PHP and Go does it

FYI, this is something related to open-telemetry/opentelemetry-specification#4083

jack-berg · 2024-07-10T22:41:31Z

To add to @codeboten analysis of existing implementations, here is java's set of options

maxAttempts
initialBackoff
maxBackoff
backoffMultiplier

These options mirror gRPC's built in retry mechanism, since gRPC was very influential in the development of OTLP and it seemed reasonable that OTLP clients using the gRPC version of the protocol would utilize the gRPC clients' built in mechanism.

One concrete thing we can / should do while we sort out how these options are configurable across languages is simply give the ability to enable / disable retry. Regardless of the options, every langauge should support some retry mechanism since the spec requires it in clear terms:

Transient errors MUST be handled with a retry strategy. This retry strategy MUST implement an exponential back-off with jitter to avoid overwhelming the destination until the network is restored or the destination has recovered.

We could do this by providing a top level retry_enabled property for the OTLP exporter type:

otlp:
  endpoint: http://localhost:4318
  retry_enabled: true
  # add in the future
  # retry_options:

Later when we add specific options we could add a separate retry_options type.

Or we could define a new retry type, with only a enabled property to start, with the intent to expand later when we decide on the options:

otlp:
  endpoint: http://localhost:4318
  retry:
    enabled: true
    # add additional options in the future

brettmc · 2024-07-10T23:54:06Z

Or we could define a new retry type, with only a enabled property to start, with the intent to expand later when we decide on the options

I like this one, combining all retry-related options into one type which we can expand on.

jack-berg · 2024-07-15T18:29:45Z

I like this one, combining all retry-related options into one type which we can expand on.

👍 This sounds good to me. Do you have interest in updating this PR to that effect? Or did you want to wait until the specific options are worked out at the spec lever (which could be some time).

brettmc · 2024-07-16T10:31:30Z

Do you have interest in updating this PR to that effect?

Updated to only implement retry.enabled (boolean, default true). The rest of the configuration will need to wait until there is agreement on what inputs are required.

examples/kitchen-sink.yaml

jack-berg · 2024-07-16T19:18:39Z

examples/kitchen-sink.yaml

+            # Configure retry policy
+            retry:
+              # Configure retry enabled
+              enabled: true


If we buy my argument that enabled: true by default, then we should rename it to disabled to align with spec naming conventions for booleans:

All Boolean environment variables SHOULD be named and defined such that false is the expected safe default behavior.

jack-berg · 2024-07-19T20:56:17Z

examples/kitchen-sink.yaml

+            # Configure retry policy.
+            retry:
+              # Configure retry disabled.
+              disabled: false


What do you think of @carlosalberto's suggestion to have a strategy field, defaulting to exponential_retry and supporting none.

I really like it, and I think it could resolve my issues. If we allow for future retry strategies, including custom/3rd party strategies, then SIGs have the freedom to provide their own configuration to suit their specific implementation.
If the official exponential retry strategy is locked down in spec, SIGs can migrate across to it at their leisure (w.r.t using it with file-based configuration).

How about treating retry policy similarly to how we treat a Sampler - it's configuration is optional and defaults to exponential_retry, at most one can be provided, and custom is allowed? Something like:

retry: disabled: false <policy-name>: property: value

disabled could actually be replaced by a noop/none/noretry policy, which might make the interface between an exporter and the retry policy a little cleaner.

zeitlinger · 2024-08-06T10:56:18Z

examples/anchors.yaml

@@ -12,6 +12,8 @@ exporters:
      api-key: !!str 1234
    compression: gzip
    timeout: 10000
+    retry:
+      disabled: false


enabled seems to be easier - avoids double negation

The spec defines that options like this have to be disabled and not enabled. But I can't find it right now... so I may be wrong.

Oh, just saw Jack already linked to it earlier, https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/configuration/sdk-environment-variables.md#boolean-value

jack-berg · 2024-08-19T15:14:42Z

@brettmc it doesn't seem like there's consensus to add the corresponding option in the spec (spec PR #4148). If you know of use cases where PHP users want to disable retry, please comment on that PR.

Else, it seems like we are blocked on this until specification issues are resolved.

brettmc requested a review from a team June 14, 2024 06:22

update changelog

9b5da6e

brettmc mentioned this pull request Jun 14, 2024

How to configure timeout and retry of exporter open-telemetry/opentelemetry-php#1317

Open

weslenteche approved these changes Jun 15, 2024

View reviewed changes

add retry to all exporters and examples

1c35fd7

brettmc commented Jun 27, 2024

View reviewed changes

brettmc mentioned this pull request Jun 27, 2024

adding export retry configuration open-telemetry/opentelemetry-php#1341

Draft

Merge branch 'main' into export-retry

ba14262

Abhishekkr3003 mentioned this pull request Jul 2, 2024

Add config to enable Default Exponential Histogram for Prometheus Exporter open-telemetry/opentelemetry-java#6541

Merged

codeboten reviewed Jul 9, 2024

View reviewed changes

brettmc mentioned this pull request Jul 10, 2024

Clarify how export retry should be implemented open-telemetry/opentelemetry-specification#4138

Open

brettmc added 2 commits July 16, 2024 10:25

pare back retry config to just the agreed-upon enabled flag

3cb4d33

dont allow additional properties

81e7bb4

jack-berg reviewed Jul 16, 2024

View reviewed changes

jack-berg mentioned this pull request Jul 16, 2024

Add explicit option for SDK otlp exporter to disable retry open-telemetry/opentelemetry-specification#4148

Closed

change enabled to disabled

c1ec4f3

jack-berg reviewed Jul 19, 2024

View reviewed changes

zeitlinger reviewed Aug 6, 2024

View reviewed changes

codeboten added the blocked:spec The issue is blocked on having spec language on the topic. label Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add exporter retry configuration #97

add exporter retry configuration #97

brettmc commented Jun 14, 2024 •

edited

Loading

brettmc Jun 27, 2024

codeboten left a comment

brettmc commented Jul 10, 2024

svrnm commented Jul 10, 2024

jack-berg commented Jul 10, 2024

brettmc commented Jul 10, 2024

jack-berg commented Jul 15, 2024

brettmc commented Jul 16, 2024

jack-berg Jul 16, 2024

jack-berg Jul 19, 2024

brettmc Jul 20, 2024

zeitlinger Aug 6, 2024

tsloughter Aug 21, 2024

tsloughter Aug 21, 2024

jack-berg commented Aug 19, 2024

add exporter retry configuration #97

Are you sure you want to change the base?

add exporter retry configuration #97

Conversation

brettmc commented Jun 14, 2024 • edited Loading

brettmc Jun 27, 2024

Choose a reason for hiding this comment

codeboten left a comment

Choose a reason for hiding this comment

brettmc commented Jul 10, 2024

svrnm commented Jul 10, 2024

jack-berg commented Jul 10, 2024

brettmc commented Jul 10, 2024

jack-berg commented Jul 15, 2024

brettmc commented Jul 16, 2024

jack-berg Jul 16, 2024

Choose a reason for hiding this comment

jack-berg Jul 19, 2024

Choose a reason for hiding this comment

brettmc Jul 20, 2024

Choose a reason for hiding this comment

zeitlinger Aug 6, 2024

Choose a reason for hiding this comment

tsloughter Aug 21, 2024

Choose a reason for hiding this comment

tsloughter Aug 21, 2024

Choose a reason for hiding this comment

jack-berg commented Aug 19, 2024

brettmc commented Jun 14, 2024 •

edited

Loading