Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metrics Framework] Add enable/disable switch to metrics framework #11627

Closed
atharvasharma61 opened this issue Dec 18, 2023 · 16 comments
Closed
Labels
enhancement Enhancement or improvement to existing feature or request Plugins

Comments

@atharvasharma61
Copy link
Contributor

atharvasharma61 commented Dec 18, 2023

Is your feature request related to a problem? Please describe

As of today there is no dynamic switch to enable/disable the metrics framework. It is possible that we might require to disable the metrics framework after running into potential issues like high resource consumptions by the framework or facing exceptions while processing or exporting the metrics data. Metrics framework is currently initialised during the Node setup.

Under the scope of this discussion, no metrics being emitted is the expected state when the metric framework is disabled. There are different ways to achieve this state, which are discussed below.

Describe the solution you'd like

Approach no. 1

One such option is to put the counter logic behind a common flag at the framework level. We can create a wrapper around the counters which will use no-op counter instance if flag is disabled. This will temporarily stop the incrementation of counters. However, re-enabling the flag after a period of disablement would cause the counters to resume incrementing from their previous values. This behaviour leads to inaccurate metric values over time.
When the global flag is disabled, the metrics framework can enter a passive state, temporarily stopping the incrementation of counters. During this period, metrics would still be emitted with the existing values.

This can be mitigated by using the delta functionality provided by the OtlpGrpcMetricExporter. At regular intervals, typically every minute, the system examines whether there has been any alteration in the counter values. If a difference is detected, the exporter emits only the varying values corresponding to the respective counters.

Example:
Consider the scenario where the current absolute value of the counter search.query.type.match.count is 5. If, by the end of the minute, this absolute value becomes 8, then the exporter will emit only the counter value of 3 – the difference between the current and previous values.

To summarise, if the common flag is disabled, then counters will not be incremented and since the exporter only emits the metric if their value changes within the last minute, no metrics would be emitted which is a desired state. However it is worth noting that exporter would still be executing the delta logic and evaluating whether it needs to emit the metrics or not. We would only be saving the resources associated with to the exact metric export logic.

Approach no. 2

Another alternative is to stop the exporter dynamically during the runtime but this feature is not yet available in the open telemetry. If we are able to implement it, we would be able to save the computational power consumed by the exporter.

Approach no. 3

Another way to achieve this state is through initializing a proxy plugin during the node boot up and dynamically enable or disable Otel.

Approach no. 4

Otel does not provide the support to reset the counter values. Another approach suggests that we can manage the metrics registry on our side instead of Otel maintaining it. This way we can reset the counter values to 0 if flag is disabled, at any point of time. When re-enabled, counters will start again from 0. This would allow us to emit the absolute values of the counters, instead of computing delta and emitting it.

Related component

Plugins

Describe alternatives you've considered

No response

Additional context

No response

@atharvasharma61 atharvasharma61 added enhancement Enhancement or improvement to existing feature or request untriaged labels Dec 18, 2023
@Gaganjuneja
Copy link
Contributor

I think most of the cases we need the delta value onlyAggregationTemporality.DELTA.

@reta, your thoughts here?

@reta
Copy link
Collaborator

reta commented Dec 19, 2023

@atharvasharma61 @Gaganjuneja I would strongly advocate against this feature (which we have discussed on the initial pull request for metrics integration). Metrics are very often used by SRE teams for alerting purposes, and altering metrics generation or publishing dynamically could have severe consequences.

@Gaganjuneja
Copy link
Contributor

I agree, You are right it will shut down the entire monitoring/observability system. But this is only needed in the extreme situations where cluster is going bad because of the metric computation logic or something else and we want to take a conscious decision to disable this framework for sometime despite knowing its adverse impacts.

@reta
Copy link
Collaborator

reta commented Dec 19, 2023

But this is only needed in the extreme situations where cluster is going bad because of the metric computation

Disable plugin, restart nodes one by one. The consequences of the feature far outweigh any benefits it offers (in my opinion).

@Gaganjuneja
Copy link
Contributor

@reta, Opting for a rolling restart is not a good option for bigger clusters due to potential side effects. Our discussion has highlighted two primary concerns:

  1. Incorrect Metrics: The disabled state may result in inaccurate metrics if increments or updates are not applied. Specifically, OpenTelemetry (Otel) retains the previous metric value during this phase, causing a restart to resume from the prior point. This issue is limited to AggregationTemporality.CUMULATIVE. Notably, Otel's recommended low-memory temporality selector is DELTA for counters and histograms. While incorrect metrics may not pose a problem in most cases, updown counters exhibit limitations, which are clearly outlined in this

https://github.com/open-telemetry/opentelemetry-java/blob/d9f9812d4375a4229caff43bd681c50b7a45776a/sdk/metrics/src/main/java/io/opentelemetry/sdk/metrics/export/AggregationTemporalitySelector.java#L62

  1. Alarms & Alerting: This functionality must be disabled in both rolling restart and metric disablement scenarios, and should be used with caution.

Providing convenient control to disable metrics proves advantageous when managing multiple clusters, particularly when such actions interfere with business-as-usual.

@reta
Copy link
Collaborator

reta commented Jan 11, 2024

@reta, Opting for a rolling restart is not a good option for bigger clusters due to potential side effects. Our discussion has highlighted two primary concerns:

@Gaganjuneja I suspect are won't be asking users to go and read Java code to understand the limitation of the disabling metrics. To me, as far as the behaviour has side effects that could impact operations in underdefined way (or the ways out of our control) - this is no go. Certainly, opinions and ideas how to deal with that are welcomed.

@Gaganjuneja
Copy link
Contributor

I suspect are won't be asking users to go and read Java code to understand the limitation of the disabling metrics.

Definitely not, I am referring here to the product documentation we all the settings are described. https://opensearch.org/docs/latest/install-and-configure/configuring-opensearch/configuration-system/

@msfroh
Copy link
Collaborator

msfroh commented Jan 11, 2024

Consider the scenario where the current absolute value of the counter search.query.type.match.count is 5. If, by the end of the minute, this absolute value becomes 8, then the exporter will emit only the counter value of 3 – the difference between the current and previous values.

Doesn't this go against the whole idea of OTel metrics and go back to the abomination that is node stats?

Going forward, shouldn't we abandon monotonically-increasing counters and emit spans related to requests and leave it up to collectors to emit data points for a given interval?

@reta
Copy link
Collaborator

reta commented Jan 11, 2024

Going forward, shouldn't we abandon monotonically-increasing counters and emit spans related to requests and leave it up to collectors to emit data points for a given interval?

Spans are traces, I think we should not confuse them with metrics

leave it up to collectors to emit data points for a given interval?

I think the problem is we just cannot "erase all metrics", collector will still get metrics that would not be updated. Now here is fun one, with this setting, you can flip metrics over any time back and forth, I could only guess what the result would be.

@msfroh
Copy link
Collaborator

msfroh commented Jan 11, 2024

Spans are traces, I think we should not confuse them with metrics.

Yeah... I really need to wrap my head around the OTel terminology.

The model that I'm familiar with (from many years at Amazon) is one where at the start of a request (or more generally an "operation"), you create a Metrics instance and somehow propagate it through the code of the operation (where the exact means of propagation is up to the developer -- some folks do it explicitly. adding a Metrics parameter on every method; others store it in a ThreadLocal, which only works if your operation is processed on a single thread; sometimes it's stored in a request context object that gets be propagated). Throughout the code, whenever someone wants to count something, they can call metrics.addCount(String name, double value, Unit<?> unit). Then, at the end of the request, the Metrics instance gets flushed to an out-of-process collector (which usually happens via a frequently-rotated log file, but could also happen via a named pipe or something; if I understand correctly this is what OTel calls an "exporter"). That collector accumulates all the distinct counters (also called metrics, confusingly) for 1 minute (or whatever sampling period you choose, but in practice it's usually a minute), collecting them in a histogram. Every minute, those histograms are sent off to the time-series metric store (in our case, usually CloudWatch). The service only accumulates stuff for in-flight requests, and the collector only accumulates one period worth of data before flushing. No persistent counters.

I was under the impression that OTel will allow us to do the same thing with OpenSearch. If everything is ephemeral, turning off the metrics framework should just lead to missing datapoints, shouldn't it?

@reta
Copy link
Collaborator

reta commented Jan 11, 2024

I was under the impression that OTel will allow us to do the same thing with OpenSearch. If everything is ephemeral, turning off the metrics framework should just lead to missing datapoints, shouldn't it?

We don't know what those metrics are used for, right? Is the cluster / node dead? Are there any alerts?

@msfroh
Copy link
Collaborator

msfroh commented Jan 11, 2024

We don't know what those metrics are used for, right? Is the cluster / node dead? Are there any alerts?

Aha... yeah, I agree that turning off metrics is terrible from an alerting standpoint.

I misunderstood the concern as being able to get the cluster into an invalid state if metrics are disabled.

@Gaganjuneja
Copy link
Contributor

I still think, keep a dynamic flag for experimentation, remove it on general availability (GA) to maintain flexibility without disruptive cluster restarts. This will be helpful in issue handling because of some bugs, increased memory foot prints etc. I realized this necessity while conducting performance tests, with simultaneous testing by other teams on the same cluster.

@msfroh msfroh removed the untriaged label Jan 31, 2024
@msfroh
Copy link
Collaborator

msfroh commented Jan 31, 2024

I agree with @reta -- if we want to disable metrics during the experimental phase, it should be a static setting (i.e turn off the feature flag). Adding a dynamic flag (even if we intend to remove it before GA) seems really dangerous without offering much benefit.

@andrross
Copy link
Member

I still think, keep a dynamic flag for experimentation, remove it on general availability (GA)

Another option is to maintain this as a separate patch that is applied to your development builds to provide a dynamic switch. That may be a bit cumbersome but I think we've got consensus that we don't want this feature on the main branch, even if it is intended to be temporary.

@reta
Copy link
Collaborator

reta commented Jan 31, 2024

@Gaganjuneja @atharvasharma61 I am closing this issue as per conclusions from maintainers .

@reta reta closed this as completed Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Plugins
Projects
None yet
Development

No branches or pull requests

5 participants