Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global error handler cleanup - Metrics SDK #2185

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
704b848
initial commit
lalitb Oct 8, 2024
b8cb6af
change name for exponential histogram
lalitb Oct 8, 2024
a0b6eee
Merge branch 'main' into global-error-handler-cleanup-metrics-sdk
lalitb Oct 9, 2024
acf97fa
rever changes
lalitb Oct 9, 2024
7b48f14
Update opentelemetry-sdk/src/metrics/internal/mod.rs
lalitb Oct 9, 2024
ac61b79
convert otel_error to otel_warn for cardinality limit
lalitb Oct 9, 2024
a42d516
log both existing and new instrument values
lalitb Oct 9, 2024
dbaa7f5
Merge branch 'main' into global-error-handler-cleanup-metrics-sdk
lalitb Oct 9, 2024
73fca4d
build error
lalitb Oct 9, 2024
ee5c5f5
Merge branch 'global-error-handler-cleanup-metrics-sdk' of github.com…
lalitb Oct 9, 2024
3aa97cf
remove formatting in wrapper macros
lalitb Oct 10, 2024
de54afe
Merge branch 'main' into global-error-handler-cleanup-metrics-sdk
lalitb Oct 23, 2024
e62de04
Merge branch 'main' into global-error-handler-cleanup-metrics-sdk
lalitb Oct 23, 2024
bda9faa
review comments
lalitb Oct 23, 2024
4a34922
resolve conflict
lalitb Oct 23, 2024
0087c24
lint error
lalitb Oct 23, 2024
115e73f
add comment for exponential histogram
lalitb Oct 23, 2024
56682a4
fix
lalitb Oct 23, 2024
7e48cbf
Update opentelemetry-sdk/src/metrics/internal/mod.rs
lalitb Oct 23, 2024
d81b374
use message for otel_warn
lalitb Oct 23, 2024
4b09c92
merge conflict
lalitb Oct 23, 2024
949930e
Update opentelemetry-sdk/src/metrics/meter.rs
lalitb Oct 23, 2024
6dcc9eb
Update opentelemetry-sdk/src/metrics/internal/mod.rs
lalitb Oct 23, 2024
4b42fe8
Update opentelemetry-sdk/src/metrics/meter.rs
lalitb Oct 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions opentelemetry-sdk/src/metrics/internal/aggregate.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,11 @@ pub(crate) fn is_under_cardinality_limit(size: usize) -> bool {
size < STREAM_CARDINALITY_LIMIT as usize
}

/// Checks whether aggregator has hit cardinality limit for metric streams
pub(crate) fn cardinality_limit() -> usize {
STREAM_CARDINALITY_LIMIT as usize
}

/// Receives measurements to be aggregated.
pub(crate) trait Measure<T>: Send + Sync + 'static {
fn call(&self, measurement: T, attrs: &[KeyValue]);
Expand Down
21 changes: 17 additions & 4 deletions opentelemetry-sdk/src/metrics/internal/exponential_histogram.rs
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
use std::{collections::HashMap, f64::consts::LOG2_E, sync::Mutex, time::SystemTime};

use once_cell::sync::Lazy;
use opentelemetry::{metrics::MetricsError, KeyValue};
use opentelemetry::{otel_error, KeyValue};

use crate::{
metrics::data::{self, Aggregation, Temporality},
Expand Down Expand Up @@ -100,9 +100,22 @@ impl<T: Number> ExpoHistogramDataPoint<T> {
if (self.scale - scale_delta as i8) < EXPO_MIN_SCALE {
// With a scale of -10 there is only two buckets for the whole range of f64 values.
// This can only happen if there is a max size of 1.
opentelemetry::global::handle_error(MetricsError::Other(
"exponential histogram scale underflow".into(),
));

// This error is logged when a measurement is dropped because its value is either too high
// or too low, falling outside the allowable range defined by the scale and max_size.
// If these values are expected, the user should adjust the histogram configuration
// to accommodate the range.
otel_error!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the inner workings enough to give a strong opinion - but unless this is a auto recoverable error, this can flood the error log.

Copy link
Member Author

@lalitb lalitb Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As what I can understand this part of code, this error occurs with restrictive max_size configuration, while the application is recording measurements with values that are far apart than what allowed by max_size. And error would be logged whenever the faulty measurement is recorded. If these faulty measurements are not frequent, the error log won't be flooded, else it can. Again, either some kind of throttling or simply flag to log only once need to be added. Let me know what you suggest, else I can keep TODO to revisit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we are 100% sure this cannot cause flooding of logs, lets remove the log from here, and leave a TODO to add logging once we understand more.

name: "ExponentialHistogramDataPoint.Scale.Underflow",
current_scale = self.scale,
scale_delta = scale_delta,
max_size = self.max_size,
min_scale = EXPO_MIN_SCALE,
record_min_max = self.record_min_max,
record_sum = self.record_sum,
value = format!("{:?}", v),
error = format!("The measurement will be dropped due to scale underflow. Check the histogram configuration")
);
return;
}
// Downscale
Expand Down
12 changes: 6 additions & 6 deletions opentelemetry-sdk/src/metrics/internal/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,11 @@ use std::ops::{Add, AddAssign, Sub};
use std::sync::atomic::{AtomicBool, AtomicI64, AtomicU64, AtomicUsize, Ordering};
use std::sync::{Arc, RwLock};

use aggregate::is_under_cardinality_limit;
use aggregate::{cardinality_limit, is_under_cardinality_limit};
pub(crate) use aggregate::{AggregateBuilder, ComputeAggregation, Measure};
pub(crate) use exponential_histogram::{EXPO_MAX_SCALE, EXPO_MIN_SCALE};
use once_cell::sync::Lazy;
use opentelemetry::metrics::MetricsError;
use opentelemetry::{global, otel_warn, KeyValue};
use opentelemetry::{otel_warn, KeyValue};

use crate::metrics::AttributeSet;

Expand Down Expand Up @@ -146,9 +145,10 @@ impl<AU: AtomicallyUpdate<T>, T: Number, O: Operation> ValueMap<AU, T, O> {
let new_tracker = AU::new_atomic_tracker(self.buckets_count);
O::update_tracker(&new_tracker, measurement, index);
trackers.insert(STREAM_OVERFLOW_ATTRIBUTES.clone(), Arc::new(new_tracker));
global::handle_error(MetricsError::Other("Warning: Maximum data points for metric stream exceeded. Entry added to overflow. Subsequent overflows to same metric until next collect will not be logged.".into()));
otel_warn!( name: "ValueMap.measure",
message = "Maximum data points for metric stream exceeded. Entry added to overflow. Subsequent overflows to same metric until next collect will not be logged."
//TODO - include name of meter, instrument
otel_warn!( name: "MetricCardinalityLimitReached",
message = format!("{}", "Maximum data points for metric stream exceeded. Entry added to overflow. Subsequent overflows to same metric will not be logged until next collect."),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the need for format?

cardinality_limit = cardinality_limit() as u64,
);
}
}
Expand Down
11 changes: 6 additions & 5 deletions opentelemetry-sdk/src/metrics/manual_reader.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ use std::{
};

use opentelemetry::{
global,
metrics::{MetricsError, Result},
otel_debug,
};

use super::{
Expand Down Expand Up @@ -76,10 +76,11 @@ impl MetricReader for ManualReader {
// Only register once. If producer is already set, do nothing.
if inner.sdk_producer.is_none() {
inner.sdk_producer = Some(pipeline);
} else {
global::handle_error(MetricsError::Config(
"duplicate reader registration, did not register manual reader".into(),
))
} else {
otel_debug!(
name: "ManualReader.RegisterPipeline.DuplicateRegistration",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to leave out the implementation details from the event names. We might or might not change this method name or we might rename pipeline to something else in the future.

Suggested change
name: "ManualReader.RegisterPipeline.DuplicateRegistration",
name: "ManualReader.DuplicateRegistration",

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should come up with some convention all these events would follow. For example, something like this:
{crate}.{component}.{optional subcomponent}.{eventName}

Sdk.MetricReader.ManualReader.DuplicateRegistration

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are already adding crate name in the macro. Trying to follow the naming as
ModuleName.SecondLevel.OptionalThirdLevel.EventName

second lvel could be struct / method in this module.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internal modules could still be moved around or refactored so I think we should rely on:
component name (spec concepts such as MeterProvider, MetricReader, etc.) -> sub component(s) -> event name

error = "The pipeline is already registered to the Reader. Registering pipeline multiple times is not allowed."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use message here as well instead of error?

);
}
});
}
Expand Down
Loading
Loading