-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Global error handler cleanup - Metrics SDK #2185
Global error handler cleanup - Metrics SDK #2185
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2185 +/- ##
=======================================
- Coverage 79.1% 78.9% -0.3%
=======================================
Files 121 121
Lines 21171 21220 +49
=======================================
- Hits 16760 16751 -9
- Misses 4411 4469 +58 ☔ View full report in Codecov by Sentry. |
Co-authored-by: Cijo Thomas <[email protected]>
…:lalitb/opentelemetry-rust into global-error-handler-cleanup-metrics-sdk
opentelemetry::global::handle_error(MetricsError::Other( | ||
"exponential histogram scale underflow".into(), | ||
)); | ||
otel_error!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the inner workings enough to give a strong opinion - but unless this is a auto recoverable error, this can flood the error log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As what I can understand this part of code, this error occurs with restrictive max_size
configuration, while the application is recording measurements with values that are far apart than what allowed by max_size
. And error would be logged whenever the faulty measurement is recorded. If these faulty measurements are not frequent, the error log won't be flooded, else it can. Again, either some kind of throttling or simply flag to log only once need to be added. Let me know what you suggest, else I can keep TODO to revisit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless we are 100% sure this cannot cause flooding of logs, lets remove the log from here, and leave a TODO to add logging once we understand more.
global::handle_error(MetricsError::Config( | ||
"duplicate reader registration, did not register manual reader".into(), | ||
)) | ||
otel_error!(name: "ManualReader.RegisterPipeline.DuplicateRegistration"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
info/debug only. Even if a user gets this message, they won't know what to do. Its helpful to us only.
@@ -74,7 +74,7 @@ impl SdkMeter { | |||
{ | |||
let validation_result = validate_instrument_config(builder.name.as_ref(), &builder.unit); | |||
if let Err(err) = validation_result { | |||
global::handle_error(err); | |||
otel_error!(name: "SdkMeter.CreateCounter.ValidationError", error = format!("{}", err)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree Error is the right severity for this, but this is still not so user-friendly.
Since this is user facing, the internal log should be clearly actionable, along with the impact of this error.
In this case, the impact is - measurements reported using this instrument will be ignore.
The "why" is also not so obvious.
As an end-user, I'd prefer to see something like
Name: InstrumentCreationFailed
InstrumentName: A_non_asccii_Metric
MeterName: MeterName
Reason: Instrument Name contains non-ascii charactors. Link to spec, optionally.
Message: Measurements from this instruments will be ignored.
@@ -90,7 +90,7 @@ impl SdkMeter { | |||
{ | |||
Ok(counter) => Ok(counter), | |||
Err(err) => { | |||
global::handle_error(err); | |||
otel_error!(name: "SdkMeter.CreateCounter.Error", error = format!("{}", err)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as https://github.com/open-telemetry/opentelemetry-rust/pull/2185/files#r1796240076
I dont think a user would know the difference between ValidationError vs Error.
@lalitb Let us know if this is ready for another review. Unfortunately, it got some conflicts too to be resolved, hopefully simple ones. |
Resolved the conflicts. Will go through the review comments now. |
Co-authored-by: Utkarsh Umesan Pillai <[email protected]>
} else { | ||
otel_debug!( | ||
name: "ManualReader.RegisterPipeline.DuplicateRegistration", | ||
error = "The pipeline is already registered to the Reader. Registering pipeline multiple times is not allowed." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use message
here as well instead of error
?
)) | ||
} else { | ||
otel_debug!( | ||
name: "ManualReader.RegisterPipeline.DuplicateRegistration", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to leave out the implementation details from the event names. We might or might not change this method name or we might rename pipeline to something else in the future.
name: "ManualReader.RegisterPipeline.DuplicateRegistration", | |
name: "ManualReader.DuplicateRegistration", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should come up with some convention all these events would follow. For example, something like this:
{crate}.{component}.{optional subcomponent}.{eventName}
Sdk.MetricReader.ManualReader.DuplicateRegistration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are already adding crate name in the macro. Trying to follow the naming as
ModuleName.SecondLevel.OptionalThirdLevel.EventName
second lvel could be struct / method in this module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Internal modules could still be moved around or refactored so I think we should rely on:
component name (spec concepts such as MeterProvider, MetricReader, etc.) -> sub component(s) -> event name
otel_error!( name: "PeriodicReader.ExportFailed", error = format!("{:?}", err)); | ||
} | ||
MetricsError::ReaderShutdown => { | ||
otel_debug!( name: "PeriodicReader.ReaderShutdown", error = format!("{:?}", err)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could use otel_error
for each of these variants as they all seem important. Metrics export is anyway not a hot-path operation so we don't have to worry about flooding of error messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought of doing same way, @cijothomas commented in earlier PRs to only keep user-actionable one's in the error, and rest in debug - irrespective of not in hot-path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shutdown failures are propagated to users via Result anyway, so no need of another Error level log.
kind = format!("{:?}", id.kind), | ||
unit = format!("{}",id.unit), | ||
number = format!("{}", id.number), | ||
existing_name = format!("{}", existing.name), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can skip logging this set of attributes twice. The event name says that we have a duplicate stream so it should already be understood that we have two identical sets of attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not very sure of the logic here. Can you check once. As we return at line 414 if they are same.
@@ -710,8 +712,7 @@ where | |||
|
|||
if errs.is_empty() { | |||
if measures.is_empty() { | |||
// TODO: Emit internal log that measurements from the instrument | |||
// are being dropped due to view configuration | |||
// Error is logged elsewhere. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the if
check?
} | ||
Either::Right(_) => Err(MetricsError::ExportTimeout( | ||
"PeriodicReader".into(), | ||
self.timeout.as_nanos(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we converting this into nanoseconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we don't seem to be using the string or time duration in logging at all. Do we need them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are doing it here
#[error("Metrics reader {0} failed with timeout. Max configured timeout: {1} ns")]
ExportTimeout(String, u128),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions here:
- Why are we adding these new variants to the
MetricsError
enum? Would the end user be able to consume these variants? They only seem to be used in internal methods. - If we decide that these variants have to be public, why use nanoseconds for time out? Why not milliseconds?
Co-authored-by: Utkarsh Umesan Pillai <[email protected]>
Co-authored-by: Utkarsh Umesan Pillai <[email protected]>
Co-authored-by: Utkarsh Umesan Pillai <[email protected]>
message = "Maximum data points for metric stream exceeded. Entry added to overflow. Subsequent overflows to same metric until next collect will not be logged." | ||
//TODO - include name of meter, instrument | ||
otel_warn!( name: "MetricCardinalityLimitReached", | ||
message = format!("{}", "Maximum data points for metric stream exceeded. Entry added to overflow. Subsequent overflows to same metric will not be logged until next collect."), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the need for format
?
@@ -75,14 +75,18 @@ impl SdkMeter { | |||
{ | |||
let validation_result = validate_instrument_config(builder.name.as_ref(), &builder.unit); | |||
if let Err(err) = validation_result { | |||
global::handle_error(err); | |||
otel_error!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otel_error!(
name: "InstrumentCreationFailed",
meter_name = self.scope.name.as_ref(),
instrument_name = builder.name.as_ref(),
message = "Measurements from this instrument will be ignored."
reason = fmt(err))
^ I prefer this version. Please see if this is better for end users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error vs Warning - this is debatable. We need to find some agreement to follow throughout.
If entire telemetry flow is affected, then Error is apt.
for missing measurements from a single instrument - warn might be sufficient. This is good to be followed up separately anyway.
@lalitb @utpilla This is getting too big with large number of comments, making it even harder to keep up. |
I can try that. Can move this to draft for now, and will create further smalls along with incorporating the existing comments. Let me know if that's fine ? |
Yes that works! |
closing, the PRs has been split to multiple small PRs. |
Fixes #
Design discussion issue (if applicable) #
Changes
Please provide a brief description of the changes here.
Merge requirement checklist
CHANGELOG.md
files updated for non-trivial, user-facing changes