-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Metrics Framework] Add enable/disable switch to metrics framework #11627
Comments
I think most of the cases we need the delta value only @reta, your thoughts here? |
@atharvasharma61 @Gaganjuneja I would strongly advocate against this feature (which we have discussed on the initial pull request for metrics integration). Metrics are very often used by SRE teams for alerting purposes, and altering metrics generation or publishing dynamically could have severe consequences. |
I agree, You are right it will shut down the entire monitoring/observability system. But this is only needed in the extreme situations where cluster is going bad because of the metric computation logic or something else and we want to take a conscious decision to disable this framework for sometime despite knowing its adverse impacts. |
Disable plugin, restart nodes one by one. The consequences of the feature far outweigh any benefits it offers (in my opinion). |
@reta, Opting for a rolling restart is not a good option for bigger clusters due to potential side effects. Our discussion has highlighted two primary concerns:
Providing convenient control to disable metrics proves advantageous when managing multiple clusters, particularly when such actions interfere with business-as-usual. |
@Gaganjuneja I suspect are won't be asking users to go and read Java code to understand the limitation of the disabling metrics. To me, as far as the behaviour has side effects that could impact operations in underdefined way (or the ways out of our control) - this is no go. Certainly, opinions and ideas how to deal with that are welcomed. |
Definitely not, I am referring here to the product documentation we all the settings are described. https://opensearch.org/docs/latest/install-and-configure/configuring-opensearch/configuration-system/ |
Doesn't this go against the whole idea of OTel metrics and go back to the abomination that is node stats? Going forward, shouldn't we abandon monotonically-increasing counters and emit spans related to requests and leave it up to collectors to emit data points for a given interval? |
Spans are traces, I think we should not confuse them with metrics
I think the problem is we just cannot "erase all metrics", collector will still get metrics that would not be updated. Now here is fun one, with this setting, you can flip metrics over any time back and forth, I could only guess what the result would be. |
Yeah... I really need to wrap my head around the OTel terminology. The model that I'm familiar with (from many years at Amazon) is one where at the start of a request (or more generally an "operation"), you create a I was under the impression that OTel will allow us to do the same thing with OpenSearch. If everything is ephemeral, turning off the metrics framework should just lead to missing datapoints, shouldn't it? |
We don't know what those metrics are used for, right? Is the cluster / node dead? Are there any alerts? |
Aha... yeah, I agree that turning off metrics is terrible from an alerting standpoint. I misunderstood the concern as being able to get the cluster into an invalid state if metrics are disabled. |
I still think, keep a dynamic flag for experimentation, remove it on general availability (GA) to maintain flexibility without disruptive cluster restarts. This will be helpful in issue handling because of some bugs, increased memory foot prints etc. I realized this necessity while conducting performance tests, with simultaneous testing by other teams on the same cluster. |
I agree with @reta -- if we want to disable metrics during the experimental phase, it should be a static setting (i.e turn off the feature flag). Adding a dynamic flag (even if we intend to remove it before GA) seems really dangerous without offering much benefit. |
Another option is to maintain this as a separate patch that is applied to your development builds to provide a dynamic switch. That may be a bit cumbersome but I think we've got consensus that we don't want this feature on the main branch, even if it is intended to be temporary. |
@Gaganjuneja @atharvasharma61 I am closing this issue as per conclusions from maintainers . |
Is your feature request related to a problem? Please describe
As of today there is no dynamic switch to enable/disable the metrics framework. It is possible that we might require to disable the metrics framework after running into potential issues like high resource consumptions by the framework or facing exceptions while processing or exporting the metrics data. Metrics framework is currently initialised during the Node setup.
Under the scope of this discussion, no metrics being emitted is the expected state when the metric framework is disabled. There are different ways to achieve this state, which are discussed below.
Describe the solution you'd like
Approach no. 1
One such option is to put the counter logic behind a common flag at the framework level. We can create a wrapper around the counters which will use no-op counter instance if flag is disabled. This will temporarily stop the incrementation of counters. However, re-enabling the flag after a period of disablement would cause the counters to resume incrementing from their previous values. This behaviour leads to inaccurate metric values over time.
When the global flag is disabled, the metrics framework can enter a passive state, temporarily stopping the incrementation of counters. During this period, metrics would still be emitted with the existing values.
This can be mitigated by using the delta functionality provided by the OtlpGrpcMetricExporter. At regular intervals, typically every minute, the system examines whether there has been any alteration in the counter values. If a difference is detected, the exporter emits only the varying values corresponding to the respective counters.
Example:
Consider the scenario where the current absolute value of the counter
search.query.type.match.count
is 5. If, by the end of the minute, this absolute value becomes 8, then the exporter will emit only the counter value of 3 – the difference between the current and previous values.To summarise, if the common flag is disabled, then counters will not be incremented and since the exporter only emits the metric if their value changes within the last minute, no metrics would be emitted which is a desired state. However it is worth noting that exporter would still be executing the delta logic and evaluating whether it needs to emit the metrics or not. We would only be saving the resources associated with to the exact metric export logic.
Approach no. 2
Another alternative is to stop the exporter dynamically during the runtime but this feature is not yet available in the open telemetry. If we are able to implement it, we would be able to save the computational power consumed by the exporter.
Approach no. 3
Another way to achieve this state is through initializing a proxy plugin during the node boot up and dynamically enable or disable Otel.
Approach no. 4
Otel does not provide the support to reset the counter values. Another approach suggests that we can manage the metrics registry on our side instead of Otel maintaining it. This way we can reset the counter values to 0 if flag is disabled, at any point of time. When re-enabled, counters will start again from 0. This would allow us to emit the absolute values of the counters, instead of computing delta and emitting it.
Related component
Plugins
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: