Metrics stop exporting at seemingly random times every week or so #5729

ladenedge · 2024-06-29T00:51:52Z

What is the question?

We're using metrics extensively in our .NET 8 application which is deployed to a dozen or so servers. Once in a while -- say, once a week -- a single server will stop exporting to our OTel Collector. The other servers continue to work fine, and restarting our application fixes the problem.

What might be going wrong here?
Are there any configuration options that might help?
How can I debug/diagnose this problem?

Thank you for any help!

Additional context

Application is .NET 8 with the following OTel packages:

<PackageReference Include="OpenTelemetry.Contrib.Instrumentation.AWS" Version="1.0.2" />
<PackageReference Include="OpenTelemetry.Exporter.Console" Version="1.8.1" />
<PackageReference Include="OpenTelemetry.Exporter.OpenTelemetryProtocol" Version="1.8.1" />
<PackageReference Include="OpenTelemetry.Extensions.Hosting" Version="1.8.1" />
<PackageReference Include="OpenTelemetry.Instrumentation.EventCounters" Version="1.5.1-alpha.1" />
<PackageReference Include="OpenTelemetry.Instrumentation.GrpcNetClient" Version="1.6.0-beta.2" />
<PackageReference Include="OpenTelemetry.Instrumentation.Http" Version="1.8.1" />

cijothomas · 2024-07-01T15:43:32Z

Please see if you can get internal logs (Warning and above) https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/src/OpenTelemetry/README.md#self-diagnostics

Are you missing all metrics from the server, or just a subset of metrics? (There is metric cardinality caps implemented which can explain the behavior, but if it is every metric stopping at the same time, unlikely to be cause).
Is metrics missing in Console too?

ladenedge · 2024-07-04T19:01:06Z

Okay, we found a newly failing server and the diagnostic log helped a great deal. It appears we are timing out:

Exporter failed send data to collector to {0} endpoint. Data will not be sent. Exception: {1}{https://myendpoint.com:4317/}{Grpc.Core.RpcException: Status(StatusCode="DeadlineExceeded", Detail="")

Here's the full entry, repeated every minute:

2024-07-04T12:43:18.7570236Z:Exporter failed send data to collector to {0} endpoint. Data will not be sent. Exception: {1}{https://myendpoint.com:4317/}{Grpc.Core.RpcException: Status(StatusCode="DeadlineExceeded", Detail="")
   at Grpc.Net.Client.Internal.HttpClientCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
   at Grpc.Core.Interceptors.InterceptingCallInvoker.<BlockingUnaryCall>b__3_0[TRequest,TResponse](TRequest req, ClientInterceptorContext`2 ctx)
   at Grpc.Core.ClientBase.ClientBaseConfiguration.ClientBaseConfigurationInterceptor.BlockingUnaryCall[TRequest,TResponse](TRequest request, ClientInterceptorContext`2 context, BlockingUnaryCallContinuation`2 continuation)
   at Grpc.Core.Interceptors.InterceptingCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
   at OpenTelemetry.Proto.Collector.Metrics.V1.MetricsService.MetricsServiceClient.Export(ExportMetricsServiceRequest request, CallOptions options)
   at OpenTelemetry.Exporter.OpenTelemetryProtocol.Implementation.ExportClient.OtlpGrpcMetricsExportClient.SendExportRequest(ExportMetricsServiceRequest request, DateTime deadlineUtc, CancellationToken cancellationToken)}

So now I have two questions:

Is there some way to modify the timeout on the internal OTel gRPC calls?
Why might this be happening after a week or so? Once we do timeout, is this likely to get worse due to the queuing of new metrics? (Ie. is there any chance of recovering from this condition?)

Thank you again for your help!

cijothomas · 2024-07-08T17:55:27Z

Is there some way to modify the timeout on the internal OTel gRPC calls?

For gRPC, not much option to customize. There are open issues/prs for related settings that will allow exposing this. Eg: #2009

Why might this be happening after a week or so? Once we do timeout, is this likely to get worse due to the queuing of new metrics? (Ie. is there any chance of recovering from this condition?)

I don't think any "queue" up occurs today. If a batch is lost due to grpc timeout, its not saved for retry later, instead the next batch is tried.
(@vishweshbankwar to keep me honest here.)

vishweshbankwar · 2024-07-08T18:08:55Z

Is there some way to modify the timeout on the internal OTel gRPC calls?

You can increase the timeout period by setting TimeoutMilliseconds

If a batch is lost due to grpc timeout, its not saved for retry later, instead the next batch is tried.

This is correct, even if the retries are enabled. #5436

cijothomas · 2024-07-08T18:20:19Z

Is there some way to modify the timeout on the internal OTel gRPC calls?

You can increase the timeout period by setting TimeoutMilliseconds

Thanks! Even if OTel's export timeout is increased, will it get applied to the timeout used by the GrpcClient itself?

vishweshbankwar · 2024-07-08T18:39:31Z

Is there some way to modify the timeout on the internal OTel gRPC calls?
You can increase the timeout period by setting TimeoutMilliseconds

Thanks! Even if OTel's export timeout is increased, will it get applied to the timeout used by the GrpcClient itself?

Yes - it is used for setting the deadline time of a grpc call we make here.

cijothomas · 2024-07-08T18:45:09Z

Thanks!. Looks like #1735 is still open which state we don't really enforce the timeouts, but I could be wrong. (or its only for traces!)

ladenedge · 2024-07-08T18:59:18Z

You can increase the timeout period by setting TimeoutMilliseconds

Great info! However, it looks like the default is 10s, so it worries me that we're exceeding that -- especially if each call only includes the latest metrics. I could increase it to 20s or 30s, but I wonder if I'm just doing something wrong. Do you have any suggestions for diagnosing why my sends are exceeding 10s, or just how much data I'm sending? Or is this more likely a network issue between the servers and the collector?

Thanks again, and feel free to close this issue if you feel you've provided all the info you can!

cijothomas · 2024-07-08T19:10:02Z

especially if each call only includes the latest metrics.

This is only true if using Delta. If using Cumulative, then everything from start will always be exported... Are you using Delta or Cumulative?
Also, are there many Observable instruments with callbacks potentially taking a lot of time?

ladenedge · 2024-07-08T19:41:18Z

We are indeed using Delta mode:

reader.TemporalityPreference = MetricReaderTemporalityPreference.Delta;

We have a handful of observables. You're suggesting that the time it takes to observe those metrics must be accounted for in the gRPC deadline? That's interesting. We've tried to make those calls quick, but it's certainly something we could take a closer look at -- that could also explain why our servers never recover from this condition.

Any other ideas are most welcome, and thank you again for all the help!

vishweshbankwar · 2024-07-09T00:39:57Z

@ladenedge - Just to confirm, you don't have retries enabled, correct?

it's odd that once the server hits DeadlineExceeded, it is not able to recover from that and continues to throw that error until re-started.

ladenedge · 2024-07-09T01:47:05Z

I assume you're talking about retries via the HttpClient? If so, then no, I'm using the default factory.

ladenedge · 2024-07-09T13:36:28Z

Also, to follow up on the observables: are observables actually queried during the exporter's network call? Looking over our handful of observable counters, they appear quick (eg. Dictionary.Count) -- but is it even possible that they contribute to the missed deadline?

cijothomas · 2024-07-09T15:49:01Z

are observables actually queried during the exporter's network call?

No. (If you were using Prometheus scraping, then the observables callbacks are done in response to scrape request, so they do contribute to response time of the scrape itself.) In short - observable callback is not at play in your case, as you are doing push exporter. (Sorry I confused you with the mention of observables :( )

g3david405 · 2024-08-15T07:36:37Z

I face same issue

ladenedge · 2024-08-15T15:33:38Z

For what it's worth, we are increasing our timeout to 30s to see if that makes any difference. (But this change won't be deployed for a week or so, so.. not very helpful yet, heh.)

ladenedge · 2024-11-26T04:18:52Z

I'd like to try and resurrect this issue because we continue to face this problem.

To recap, we have a pool of 15 servers that live in three datacenters and AWS. Regardless of the location, they will sometimes stop writing metrics, and they are unable to recover. Restarting our app (a Windows service) resolves the problem -- for a while.

OTEL diagnostics show a DeadlineExceeded error, repeated as nauseum:

2024-11-22T14:37:16.7591358Z:Exporter failed send data to collector to {0} endpoint. Data will not be sent. Exception: {1}{https://myendpoint.com:4317/}{Grpc.Core.RpcException: Status(StatusCode="DeadlineExceeded", Detail="")
   at Grpc.Net.Client.Internal.HttpClientCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
   at Grpc.Core.Interceptors.InterceptingCallInvoker.<BlockingUnaryCall>b__3_0[TRequest,TResponse](TRequest req, ClientInterceptorContext`2 ctx)
   at Grpc.Core.ClientBase.ClientBaseConfiguration.ClientBaseConfigurationInterceptor.BlockingUnaryCall[TRequest,TResponse](TRequest request, ClientInterceptorContext`2 context, BlockingUnaryCallContinuation`2 continuation)
   at Grpc.Core.Interceptors.InterceptingCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
   at OpenTelemetry.Proto.Collector.Metrics.V1.MetricsService.MetricsServiceClient.Export(ExportMetricsServiceRequest request, CallOptions options)
   at OpenTelemetry.Exporter.OpenTelemetryProtocol.Implementation.ExportClient.OtlpGrpcMetricsExportClient.SendExportRequest(ExportMetricsServiceRequest request, DateTime deadlineUtc, CancellationToken cancellationToken)}

This occurs seemingly at random, and not necessarily at high-traffic periods. Here, for example, is a recent look at a metric that stopped writing during a fairly low-traffic period:

And here is a shot of another server from the same datacenter over the same period:

Also of interest: whatever is happening does not impact traces or logs. Here is a shot of the faulty server's traces over the same period:

Likewise, logs come in without issue over the same period, though we are using Serilog's OTEL Sink, so it's fairly separate.

Since last I checked it, we have raised our timeout to 30s:

exporterOptions.TimeoutMilliseconds = 30000;

This timeout change does not seem to have made any difference.

Some conclusions:

We are writing to an OTEL Collector in AWS, but resetting the Collector does not resolve the problem. (Only resetting the Windows service seems to resolve it.)
It does not appear to be related to traffic or load, since the fault sometimes occurs at low traffic/load.
It does not appear to be related to our network, since other servers on the same network continue to work fine. Moreover, traces and logs from the faulty server also continue to work fine.

Does anyone have any other suggestions for debugging or working around this issue?

ladenedge added the question Further information is requested label Jun 29, 2024

cijothomas added the metrics Metrics signal related label Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics stop exporting at seemingly random times every week or so #5729

Metrics stop exporting at seemingly random times every week or so #5729

ladenedge commented Jun 29, 2024

cijothomas commented Jul 1, 2024

ladenedge commented Jul 4, 2024

cijothomas commented Jul 8, 2024

vishweshbankwar commented Jul 8, 2024 •

edited

Loading

cijothomas commented Jul 8, 2024

vishweshbankwar commented Jul 8, 2024

cijothomas commented Jul 8, 2024

ladenedge commented Jul 8, 2024

cijothomas commented Jul 8, 2024

ladenedge commented Jul 8, 2024

vishweshbankwar commented Jul 9, 2024

ladenedge commented Jul 9, 2024

ladenedge commented Jul 9, 2024

cijothomas commented Jul 9, 2024

g3david405 commented Aug 15, 2024

ladenedge commented Aug 15, 2024

ladenedge commented Nov 26, 2024

Metrics stop exporting at seemingly random times every week or so #5729

Metrics stop exporting at seemingly random times every week or so #5729

Comments

ladenedge commented Jun 29, 2024

What is the question?

Additional context

cijothomas commented Jul 1, 2024

ladenedge commented Jul 4, 2024

cijothomas commented Jul 8, 2024

vishweshbankwar commented Jul 8, 2024 • edited Loading

cijothomas commented Jul 8, 2024

vishweshbankwar commented Jul 8, 2024

cijothomas commented Jul 8, 2024

ladenedge commented Jul 8, 2024

cijothomas commented Jul 8, 2024

ladenedge commented Jul 8, 2024

vishweshbankwar commented Jul 9, 2024

ladenedge commented Jul 9, 2024

ladenedge commented Jul 9, 2024

cijothomas commented Jul 9, 2024

g3david405 commented Aug 15, 2024

ladenedge commented Aug 15, 2024

ladenedge commented Nov 26, 2024

vishweshbankwar commented Jul 8, 2024 •

edited

Loading