Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global error handler cleanup - Metrics SDK #2185

Conversation

lalitb
Copy link
Member

@lalitb lalitb commented Oct 8, 2024

Fixes #
Design discussion issue (if applicable) #

Changes

Please provide a brief description of the changes here.

Merge requirement checklist

  • CONTRIBUTING guidelines followed
  • Unit tests added/updated (if applicable)
  • Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
  • Changes in public API reviewed (if applicable)

@lalitb lalitb requested a review from a team as a code owner October 8, 2024 23:56
Copy link

codecov bot commented Oct 9, 2024

Codecov Report

Attention: Patch coverage is 15.74074% with 91 lines in your changes missing coverage. Please review.

Project coverage is 78.9%. Comparing base (8bd529a) to head (115e73f).

Files with missing lines Patch % Lines
opentelemetry-sdk/src/metrics/meter.rs 21.2% 52 Missing ⚠️
opentelemetry-sdk/src/metrics/pipeline.rs 0.0% 12 Missing ⚠️
opentelemetry-sdk/src/metrics/view.rs 8.3% 11 Missing ⚠️
...-sdk/src/metrics/internal/exponential_histogram.rs 0.0% 9 Missing ⚠️
...pentelemetry-sdk/src/metrics/internal/aggregate.rs 0.0% 3 Missing ⚠️
opentelemetry-sdk/src/metrics/internal/mod.rs 50.0% 2 Missing ⚠️
opentelemetry-sdk/src/metrics/manual_reader.rs 0.0% 1 Missing ⚠️
opentelemetry-sdk/src/metrics/meter_provider.rs 0.0% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##            main   #2185     +/-   ##
=======================================
- Coverage   79.1%   78.9%   -0.3%     
=======================================
  Files        121     121             
  Lines      21171   21220     +49     
=======================================
- Hits       16760   16751      -9     
- Misses      4411    4469     +58     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

opentelemetry::global::handle_error(MetricsError::Other(
"exponential histogram scale underflow".into(),
));
otel_error!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the inner workings enough to give a strong opinion - but unless this is a auto recoverable error, this can flood the error log.

Copy link
Member Author

@lalitb lalitb Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As what I can understand this part of code, this error occurs with restrictive max_size configuration, while the application is recording measurements with values that are far apart than what allowed by max_size. And error would be logged whenever the faulty measurement is recorded. If these faulty measurements are not frequent, the error log won't be flooded, else it can. Again, either some kind of throttling or simply flag to log only once need to be added. Let me know what you suggest, else I can keep TODO to revisit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we are 100% sure this cannot cause flooding of logs, lets remove the log from here, and leave a TODO to add logging once we understand more.

global::handle_error(MetricsError::Config(
"duplicate reader registration, did not register manual reader".into(),
))
otel_error!(name: "ManualReader.RegisterPipeline.DuplicateRegistration");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

info/debug only. Even if a user gets this message, they won't know what to do. Its helpful to us only.

@@ -74,7 +74,7 @@ impl SdkMeter {
{
let validation_result = validate_instrument_config(builder.name.as_ref(), &builder.unit);
if let Err(err) = validation_result {
global::handle_error(err);
otel_error!(name: "SdkMeter.CreateCounter.ValidationError", error = format!("{}", err));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree Error is the right severity for this, but this is still not so user-friendly.
Since this is user facing, the internal log should be clearly actionable, along with the impact of this error.
In this case, the impact is - measurements reported using this instrument will be ignore.
The "why" is also not so obvious.
As an end-user, I'd prefer to see something like

Name: InstrumentCreationFailed
InstrumentName: A_non_asccii_Metric
MeterName: MeterName
Reason: Instrument Name contains non-ascii charactors. Link to spec, optionally.
Message: Measurements from this instruments will be ignored.

@@ -90,7 +90,7 @@ impl SdkMeter {
{
Ok(counter) => Ok(counter),
Err(err) => {
global::handle_error(err);
otel_error!(name: "SdkMeter.CreateCounter.Error", error = format!("{}", err));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as https://github.com/open-telemetry/opentelemetry-rust/pull/2185/files#r1796240076
I dont think a user would know the difference between ValidationError vs Error.

@cijothomas
Copy link
Member

@lalitb Let us know if this is ready for another review. Unfortunately, it got some conflicts too to be resolved, hopefully simple ones.

@lalitb
Copy link
Member Author

lalitb commented Oct 23, 2024

@lalitb Let us know if this is ready for another review. Unfortunately, it got some conflicts too to be resolved, hopefully simple ones.

Resolved the conflicts. Will go through the review comments now.

@lalitb lalitb marked this pull request as draft October 23, 2024 21:26
@lalitb lalitb marked this pull request as ready for review October 23, 2024 22:12
} else {
otel_debug!(
name: "ManualReader.RegisterPipeline.DuplicateRegistration",
error = "The pipeline is already registered to the Reader. Registering pipeline multiple times is not allowed."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use message here as well instead of error?

))
} else {
otel_debug!(
name: "ManualReader.RegisterPipeline.DuplicateRegistration",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to leave out the implementation details from the event names. We might or might not change this method name or we might rename pipeline to something else in the future.

Suggested change
name: "ManualReader.RegisterPipeline.DuplicateRegistration",
name: "ManualReader.DuplicateRegistration",

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should come up with some convention all these events would follow. For example, something like this:
{crate}.{component}.{optional subcomponent}.{eventName}

Sdk.MetricReader.ManualReader.DuplicateRegistration

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are already adding crate name in the macro. Trying to follow the naming as
ModuleName.SecondLevel.OptionalThirdLevel.EventName

second lvel could be struct / method in this module.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internal modules could still be moved around or refactored so I think we should rely on:
component name (spec concepts such as MeterProvider, MetricReader, etc.) -> sub component(s) -> event name

otel_error!( name: "PeriodicReader.ExportFailed", error = format!("{:?}", err));
}
MetricsError::ReaderShutdown => {
otel_debug!( name: "PeriodicReader.ReaderShutdown", error = format!("{:?}", err));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could use otel_error for each of these variants as they all seem important. Metrics export is anyway not a hot-path operation so we don't have to worry about flooding of error messages.

Copy link
Member Author

@lalitb lalitb Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of doing same way, @cijothomas commented in earlier PRs to only keep user-actionable one's in the error, and rest in debug - irrespective of not in hot-path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shutdown failures are propagated to users via Result anyway, so no need of another Error level log.

kind = format!("{:?}", id.kind),
unit = format!("{}",id.unit),
number = format!("{}", id.number),
existing_name = format!("{}", existing.name),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can skip logging this set of attributes twice. The event name says that we have a duplicate stream so it should already be understood that we have two identical sets of attributes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very sure of the logic here. Can you check once. As we return at line 414 if they are same.

@@ -710,8 +712,7 @@ where

if errs.is_empty() {
if measures.is_empty() {
// TODO: Emit internal log that measurements from the instrument
// are being dropped due to view configuration
// Error is logged elsewhere.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the if check?

}
Either::Right(_) => Err(MetricsError::ExportTimeout(
"PeriodicReader".into(),
self.timeout.as_nanos(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we converting this into nanoseconds?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we don't seem to be using the string or time duration in logging at all. Do we need them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are doing it here

#[error("Metrics reader {0} failed with timeout. Max configured timeout: {1} ns")]
ExportTimeout(String, u128),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions here:

  1. Why are we adding these new variants to the MetricsError enum? Would the end user be able to consume these variants? They only seem to be used in internal methods.
  2. If we decide that these variants have to be public, why use nanoseconds for time out? Why not milliseconds?

message = "Maximum data points for metric stream exceeded. Entry added to overflow. Subsequent overflows to same metric until next collect will not be logged."
//TODO - include name of meter, instrument
otel_warn!( name: "MetricCardinalityLimitReached",
message = format!("{}", "Maximum data points for metric stream exceeded. Entry added to overflow. Subsequent overflows to same metric will not be logged until next collect."),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the need for format?

@@ -75,14 +75,18 @@ impl SdkMeter {
{
let validation_result = validate_instrument_config(builder.name.as_ref(), &builder.unit);
if let Err(err) = validation_result {
global::handle_error(err);
otel_error!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otel_error!(
name: "InstrumentCreationFailed",
meter_name = self.scope.name.as_ref(),
instrument_name = builder.name.as_ref(),
message = "Measurements from this instrument will be ignored."
reason = fmt(err))

^ I prefer this version. Please see if this is better for end users.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error vs Warning - this is debatable. We need to find some agreement to follow throughout.
If entire telemetry flow is affected, then Error is apt.
for missing measurements from a single instrument - warn might be sufficient. This is good to be followed up separately anyway.

@cijothomas
Copy link
Member

@lalitb @utpilla This is getting too big with large number of comments, making it even harder to keep up.
Can you reduce scope so we can focus on one file/small-section-within-a-file at a time? A lot of internal logging requires reviewers to also understand the overall flow and its a lot easier to do with very targeted PR.

@lalitb
Copy link
Member Author

lalitb commented Oct 24, 2024

@lalitb @utpilla This is getting too big with large number of comments, making it even harder to keep up. Can you reduce scope so we can focus on one file/small-section-within-a-file at a time? A lot of internal logging requires reviewers to also understand the overall flow and its a lot easier to do with very targeted PR.

I can try that. Can move this to draft for now, and will create further smalls along with incorporating the existing comments. Let me know if that's fine ?

@cijothomas
Copy link
Member

@lalitb @utpilla This is getting too big with large number of comments, making it even harder to keep up. Can you reduce scope so we can focus on one file/small-section-within-a-file at a time? A lot of internal logging requires reviewers to also understand the overall flow and its a lot easier to do with very targeted PR.

I can try that. Can move this to draft for now, and will create further smalls along with incorporating the existing comments. Let me know if that's fine ?

Yes that works!

@lalitb lalitb marked this pull request as draft October 24, 2024 04:23
@lalitb
Copy link
Member Author

lalitb commented Nov 6, 2024

closing, the PRs has been split to multiple small PRs.

@lalitb lalitb closed this Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants