Global error handler cleanup - Metrics SDK #2185

lalitb · 2024-10-08T23:56:40Z

Fixes #
Design discussion issue (if applicable) #

Changes

Please provide a brief description of the changes here.

Merge requirement checklist

CONTRIBUTING guidelines followed
Unit tests added/updated (if applicable)
Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
Changes in public API reviewed (if applicable)

opentelemetry/src/global/internal_logging.rs

codecov · 2024-10-09T00:00:09Z

Codecov Report

Attention: Patch coverage is 15.74074% with 91 lines in your changes missing coverage. Please review.

Project coverage is 78.9%. Comparing base (8bd529a) to head (115e73f).

Files with missing lines	Patch %	Lines
opentelemetry-sdk/src/metrics/meter.rs	21.2%	52 Missing ⚠️
opentelemetry-sdk/src/metrics/pipeline.rs	0.0%	12 Missing ⚠️
opentelemetry-sdk/src/metrics/view.rs	8.3%	11 Missing ⚠️
...-sdk/src/metrics/internal/exponential_histogram.rs	0.0%	9 Missing ⚠️
...pentelemetry-sdk/src/metrics/internal/aggregate.rs	0.0%	3 Missing ⚠️
opentelemetry-sdk/src/metrics/internal/mod.rs	50.0%	2 Missing ⚠️
opentelemetry-sdk/src/metrics/manual_reader.rs	0.0%	1 Missing ⚠️
opentelemetry-sdk/src/metrics/meter_provider.rs	0.0%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##            main   #2185     +/-   ##
=======================================
- Coverage   79.1%   78.9%   -0.3%     
=======================================
  Files        121     121             
  Lines      21171   21220     +49     
=======================================
- Hits       16760   16751      -9     
- Misses      4411    4469     +58

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

opentelemetry-sdk/src/metrics/pipeline.rs

opentelemetry-sdk/src/metrics/periodic_reader.rs

opentelemetry-sdk/src/metrics/internal/mod.rs

Co-authored-by: Cijo Thomas <[email protected]>

…:lalitb/opentelemetry-rust into global-error-handler-cleanup-metrics-sdk

cijothomas · 2024-10-10T23:58:26Z

opentelemetry-sdk/src/metrics/internal/exponential_histogram.rs

-                opentelemetry::global::handle_error(MetricsError::Other(
-                    "exponential histogram scale underflow".into(),
-                ));
+                otel_error!(


I don't know the inner workings enough to give a strong opinion - but unless this is a auto recoverable error, this can flood the error log.

As what I can understand this part of code, this error occurs with restrictive max_size configuration, while the application is recording measurements with values that are far apart than what allowed by max_size. And error would be logged whenever the faulty measurement is recorded. If these faulty measurements are not frequent, the error log won't be flooded, else it can. Again, either some kind of throttling or simply flag to log only once need to be added. Let me know what you suggest, else I can keep TODO to revisit.

Unless we are 100% sure this cannot cause flooding of logs, lets remove the log from here, and leave a TODO to add logging once we understand more.

cijothomas · 2024-10-10T23:59:48Z

opentelemetry-sdk/src/metrics/manual_reader.rs

-                global::handle_error(MetricsError::Config(
-                    "duplicate reader registration, did not register manual reader".into(),
-                ))
+                otel_error!(name: "ManualReader.RegisterPipeline.DuplicateRegistration");


info/debug only. Even if a user gets this message, they won't know what to do. Its helpful to us only.

cijothomas · 2024-10-11T00:04:04Z

opentelemetry-sdk/src/metrics/meter.rs

@@ -74,7 +74,7 @@ impl SdkMeter {
    {
        let validation_result = validate_instrument_config(builder.name.as_ref(), &builder.unit);
        if let Err(err) = validation_result {
-            global::handle_error(err);
+            otel_error!(name: "SdkMeter.CreateCounter.ValidationError", error = format!("{}", err));


Agree Error is the right severity for this, but this is still not so user-friendly.
Since this is user facing, the internal log should be clearly actionable, along with the impact of this error.
In this case, the impact is - measurements reported using this instrument will be ignore.
The "why" is also not so obvious.
As an end-user, I'd prefer to see something like

Name: InstrumentCreationFailed
InstrumentName: A_non_asccii_Metric
MeterName: MeterName
Reason: Instrument Name contains non-ascii charactors. Link to spec, optionally.
Message: Measurements from this instruments will be ignored.

cijothomas · 2024-10-11T00:05:12Z

opentelemetry-sdk/src/metrics/meter.rs

@@ -90,7 +90,7 @@ impl SdkMeter {
        {
            Ok(counter) => Ok(counter),
            Err(err) => {
-                global::handle_error(err);
+                otel_error!(name: "SdkMeter.CreateCounter.Error", error = format!("{}", err));


Same comment as https://github.com/open-telemetry/opentelemetry-rust/pull/2185/files#r1796240076
I dont think a user would know the difference between ValidationError vs Error.

cijothomas · 2024-10-23T04:48:00Z

@lalitb Let us know if this is ready for another review. Unfortunately, it got some conflicts too to be resolved, hopefully simple ones.

lalitb · 2024-10-23T16:38:12Z

@lalitb Let us know if this is ready for another review. Unfortunately, it got some conflicts too to be resolved, hopefully simple ones.

Resolved the conflicts. Will go through the review comments now.

opentelemetry-sdk/src/metrics/internal/mod.rs

Co-authored-by: Utkarsh Umesan Pillai <[email protected]>

opentelemetry-sdk/src/metrics/internal/mod.rs

utpilla · 2024-10-23T22:17:10Z

opentelemetry-sdk/src/metrics/manual_reader.rs

+            }  else {
+                otel_debug!(
+                    name: "ManualReader.RegisterPipeline.DuplicateRegistration",
+                    error = "The pipeline is already registered to the Reader. Registering pipeline multiple times is not allowed."


Should we use message here as well instead of error?

utpilla · 2024-10-23T22:18:50Z

opentelemetry-sdk/src/metrics/manual_reader.rs

-                ))
+            }  else {
+                otel_debug!(
+                    name: "ManualReader.RegisterPipeline.DuplicateRegistration",


I think it's better to leave out the implementation details from the event names. We might or might not change this method name or we might rename pipeline to something else in the future.

Suggested change

name: "ManualReader.RegisterPipeline.DuplicateRegistration",

name: "ManualReader.DuplicateRegistration",

We should come up with some convention all these events would follow. For example, something like this:
{crate}.{component}.{optional subcomponent}.{eventName}

Sdk.MetricReader.ManualReader.DuplicateRegistration

We are already adding crate name in the macro. Trying to follow the naming as
ModuleName.SecondLevel.OptionalThirdLevel.EventName

second lvel could be struct / method in this module.

Internal modules could still be moved around or refactored so I think we should rely on:
component name (spec concepts such as MeterProvider, MetricReader, etc.) -> sub component(s) -> event name

opentelemetry-sdk/src/metrics/meter.rs

utpilla · 2024-10-23T23:05:38Z

opentelemetry-sdk/src/metrics/periodic_reader.rs

+                            otel_error!( name: "PeriodicReader.ExportFailed", error  = format!("{:?}", err));
+                        }
+                        MetricsError::ReaderShutdown => {
+                            otel_debug!( name: "PeriodicReader.ReaderShutdown", error = format!("{:?}", err));


I think we could use otel_error for each of these variants as they all seem important. Metrics export is anyway not a hot-path operation so we don't have to worry about flooding of error messages.

I thought of doing same way, @cijothomas commented in earlier PRs to only keep user-actionable one's in the error, and rest in debug - irrespective of not in hot-path.

Shutdown failures are propagated to users via Result anyway, so no need of another Error level log.

utpilla · 2024-10-23T23:09:46Z

opentelemetry-sdk/src/metrics/pipeline.rs

+                    kind = format!("{:?}", id.kind),
+                    unit = format!("{}",id.unit),
+                    number = format!("{}", id.number),
+                    existing_name = format!("{}", existing.name),


I think we can skip logging this set of attributes twice. The event name says that we have a duplicate stream so it should already be understood that we have two identical sets of attributes.

I am not very sure of the logic here. Can you check once. As we return at line 414 if they are same.

utpilla · 2024-10-23T23:22:24Z

opentelemetry-sdk/src/metrics/pipeline.rs

@@ -710,8 +712,7 @@ where

        if errs.is_empty() {
            if measures.is_empty() {
-                // TODO: Emit internal log that measurements from the instrument
-                // are being dropped due to view configuration
+                // Error is logged elsewhere.


Remove the if check?

utpilla · 2024-10-23T23:31:46Z

opentelemetry-sdk/src/metrics/periodic_reader.rs

-            }
+            Either::Right(_) => Err(MetricsError::ExportTimeout(
+                "PeriodicReader".into(),
+                self.timeout.as_nanos(),


Why are we converting this into nanoseconds?

Also, we don't seem to be using the string or time duration in logging at all. Do we need them?

we are doing it here #[error("Metrics reader {0} failed with timeout. Max configured timeout: {1} ns")] ExportTimeout(String, u128),

Two questions here:

Why are we adding these new variants to the MetricsError enum? Would the end user be able to consume these variants? They only seem to be used in internal methods.

If we decide that these variants have to be public, why use nanoseconds for time out? Why not milliseconds?

Co-authored-by: Utkarsh Umesan Pillai <[email protected]>

cijothomas · 2024-10-24T01:38:34Z

opentelemetry-sdk/src/metrics/internal/mod.rs

-                message = "Maximum data points for metric stream exceeded. Entry added to overflow. Subsequent overflows to same metric until next collect will not be logged."
+            //TODO -  include name of meter, instrument
+            otel_warn!( name: "MetricCardinalityLimitReached",
+                message = format!("{}", "Maximum data points for metric stream exceeded. Entry added to overflow. Subsequent overflows to same metric will not be logged until next collect."),


what is the need for format?

cijothomas · 2024-10-24T01:51:17Z

opentelemetry-sdk/src/metrics/meter.rs

@@ -75,14 +75,18 @@ impl SdkMeter {
    {
        let validation_result = validate_instrument_config(builder.name.as_ref(), &builder.unit);
        if let Err(err) = validation_result {
-            global::handle_error(err);
+            otel_error!(


otel_error!(
name: "InstrumentCreationFailed",
meter_name = self.scope.name.as_ref(),
instrument_name = builder.name.as_ref(),
message = "Measurements from this instrument will be ignored."
reason = fmt(err))

^ I prefer this version. Please see if this is better for end users.

Error vs Warning - this is debatable. We need to find some agreement to follow throughout.
If entire telemetry flow is affected, then Error is apt.
for missing measurements from a single instrument - warn might be sufficient. This is good to be followed up separately anyway.

cijothomas · 2024-10-24T01:58:38Z

@lalitb @utpilla This is getting too big with large number of comments, making it even harder to keep up.
Can you reduce scope so we can focus on one file/small-section-within-a-file at a time? A lot of internal logging requires reviewers to also understand the overall flow and its a lot easier to do with very targeted PR.

lalitb · 2024-10-24T03:48:01Z

@lalitb @utpilla This is getting too big with large number of comments, making it even harder to keep up. Can you reduce scope so we can focus on one file/small-section-within-a-file at a time? A lot of internal logging requires reviewers to also understand the overall flow and its a lot easier to do with very targeted PR.

I can try that. Can move this to draft for now, and will create further smalls along with incorporating the existing comments. Let me know if that's fine ?

cijothomas · 2024-10-24T04:19:35Z

@lalitb @utpilla This is getting too big with large number of comments, making it even harder to keep up. Can you reduce scope so we can focus on one file/small-section-within-a-file at a time? A lot of internal logging requires reviewers to also understand the overall flow and its a lot easier to do with very targeted PR.

I can try that. Can move this to draft for now, and will create further smalls along with incorporating the existing comments. Let me know if that's fine ?

Yes that works!

lalitb · 2024-11-06T22:25:42Z

closing, the PRs has been split to multiple small PRs.

lalitb added 2 commits October 8, 2024 16:53

initial commit

704b848

change name for exponential histogram

b8cb6af

lalitb requested a review from a team as a code owner October 8, 2024 23:56

lalitb commented Oct 8, 2024

View reviewed changes

opentelemetry/src/global/internal_logging.rs Outdated Show resolved Hide resolved

Merge branch 'main' into global-error-handler-cleanup-metrics-sdk

a0b6eee

cijothomas reviewed Oct 9, 2024

View reviewed changes

opentelemetry-sdk/src/metrics/pipeline.rs Outdated Show resolved Hide resolved

cijothomas reviewed Oct 9, 2024

View reviewed changes

opentelemetry-sdk/src/metrics/periodic_reader.rs Outdated Show resolved Hide resolved

rever changes

acf97fa

cijothomas reviewed Oct 9, 2024

View reviewed changes

opentelemetry-sdk/src/metrics/internal/mod.rs Outdated Show resolved Hide resolved

lalitb and others added 7 commits October 8, 2024 18:40

Update opentelemetry-sdk/src/metrics/internal/mod.rs

7b48f14

Co-authored-by: Cijo Thomas <[email protected]>

convert otel_error to otel_warn for cardinality limit

ac61b79

log both existing and new instrument values

a42d516

Merge branch 'main' into global-error-handler-cleanup-metrics-sdk

dbaa7f5

build error

73fca4d

Merge branch 'global-error-handler-cleanup-metrics-sdk' of github.com…

ee5c5f5

…:lalitb/opentelemetry-rust into global-error-handler-cleanup-metrics-sdk

remove formatting in wrapper macros

3aa97cf

cijothomas reviewed Oct 10, 2024

View reviewed changes

cijothomas reviewed Oct 11, 2024

View reviewed changes

lalitb added 2 commits October 22, 2024 22:05

Merge branch 'main' into global-error-handler-cleanup-metrics-sdk

de54afe

Merge branch 'main' into global-error-handler-cleanup-metrics-sdk

e62de04

lalitb added 4 commits October 23, 2024 12:14

review comments

bda9faa

resolve conflict

4a34922

lint error

0087c24

add comment for exponential histogram

115e73f

utpilla reviewed Oct 23, 2024

View reviewed changes

opentelemetry-sdk/src/metrics/internal/mod.rs Outdated Show resolved Hide resolved

utpilla reviewed Oct 23, 2024

View reviewed changes

opentelemetry-sdk/src/metrics/internal/mod.rs Outdated Show resolved Hide resolved

lalitb marked this pull request as draft October 23, 2024 21:26

lalitb and others added 4 commits October 23, 2024 15:03

fix

56682a4

Update opentelemetry-sdk/src/metrics/internal/mod.rs

7e48cbf

Co-authored-by: Utkarsh Umesan Pillai <[email protected]>

use message for otel_warn

d81b374

merge conflict

4b09c92

lalitb marked this pull request as ready for review October 23, 2024 22:12

utpilla reviewed Oct 23, 2024

View reviewed changes

opentelemetry-sdk/src/metrics/internal/mod.rs Outdated Show resolved Hide resolved

utpilla reviewed Oct 23, 2024

View reviewed changes

opentelemetry-sdk/src/metrics/meter.rs Outdated Show resolved Hide resolved

utpilla reviewed Oct 23, 2024

View reviewed changes

opentelemetry-sdk/src/metrics/meter.rs Outdated Show resolved Hide resolved

utpilla reviewed Oct 23, 2024

View reviewed changes

lalitb and others added 3 commits October 23, 2024 16:42

Update opentelemetry-sdk/src/metrics/meter.rs

949930e

Co-authored-by: Utkarsh Umesan Pillai <[email protected]>

Update opentelemetry-sdk/src/metrics/internal/mod.rs

6dcc9eb

Co-authored-by: Utkarsh Umesan Pillai <[email protected]>

Update opentelemetry-sdk/src/metrics/meter.rs

4b42fe8

Co-authored-by: Utkarsh Umesan Pillai <[email protected]>

cijothomas reviewed Oct 24, 2024

View reviewed changes

lalitb marked this pull request as draft October 24, 2024 04:23

lalitb closed this Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Global error handler cleanup - Metrics SDK #2185

Global error handler cleanup - Metrics SDK #2185

lalitb commented Oct 8, 2024

codecov bot commented Oct 9, 2024 •

edited

Loading

cijothomas Oct 10, 2024

lalitb Oct 23, 2024 •

edited

Loading

cijothomas Oct 24, 2024

cijothomas Oct 10, 2024

cijothomas Oct 11, 2024

cijothomas Oct 11, 2024

cijothomas commented Oct 23, 2024

lalitb commented Oct 23, 2024

utpilla Oct 23, 2024

utpilla Oct 23, 2024

utpilla Oct 23, 2024

lalitb Oct 23, 2024

utpilla Oct 24, 2024

utpilla Oct 23, 2024

lalitb Oct 23, 2024 •

edited

Loading

cijothomas Oct 24, 2024

utpilla Oct 23, 2024

lalitb Oct 23, 2024

utpilla Oct 23, 2024

utpilla Oct 23, 2024

utpilla Oct 23, 2024

lalitb Oct 23, 2024

utpilla Oct 24, 2024

cijothomas Oct 24, 2024

cijothomas Oct 24, 2024

cijothomas Oct 24, 2024

cijothomas commented Oct 24, 2024

lalitb commented Oct 24, 2024

cijothomas commented Oct 24, 2024

lalitb commented Nov 6, 2024

	name: "ManualReader.RegisterPipeline.DuplicateRegistration",
	name: "ManualReader.DuplicateRegistration",

Global error handler cleanup - Metrics SDK #2185

Global error handler cleanup - Metrics SDK #2185

Conversation

lalitb commented Oct 8, 2024

Changes

Merge requirement checklist

codecov bot commented Oct 9, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

lalitb Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cijothomas commented Oct 23, 2024

lalitb commented Oct 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lalitb Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cijothomas commented Oct 24, 2024

lalitb commented Oct 24, 2024

cijothomas commented Oct 24, 2024

lalitb commented Nov 6, 2024

codecov bot commented Oct 9, 2024 •

edited

Loading

lalitb Oct 23, 2024 •

edited

Loading

lalitb Oct 23, 2024 •

edited

Loading