[BUG] Cluster manager node attempts to calculate indices cache metric #305

sgup432 · 2023-03-16T18:58:20Z

What is the bug?
We saw below errors in PA log in one of the master/clusterManager node.

[2023-03-09T23:02:26,883][WARN ][o.o.p.r.f.a.Metric       ] Looking for metric Cache_FieldData_Size, when it does not exist.
[2023-03-09T23:02:26,890][ERROR][o.o.p.r.s.m.AggregateMetric] RCA: Caught an exception while getting the DB SQL [select IndexName, sum(max) from Cache_FieldData_Size group by IndexName order by sum(max) desc]; [SQLITE_ERROR] SQL error or missing database (no such table: Cache_FieldData_Size)

It attempts to calculate indices cache related metrics which is not present on cluster manager nodes. PA plugin does not write this data into shared memory.

PA plugin write below data in data node but not for cluster manager node. As expected.

^indices/.kibana_1/0
{"current_time":1678604228517}
{"Cache_Query_Hit":0,"Cache_Query_Miss":0,"Cache_Query_Size":0,"Cache_FieldData_Eviction":0,"Cache_FieldData_Size":0,"Cache_Request_Hit":0,"Cache_Request_Miss":0,"Cache_Request_Eviction":0,"Cache_Request_Size":0}$

How can one reproduce the bug?
Steps to reproduce the behavior:

Spin up a opensearch cluster with dedicated cluster manager nodes.
Create some indices and dump some data.
After a while, check PA logs in one of the cluster manager node. Should see above error log.

What is the expected behavior?
We should not calculate indices cache related metrics at cluster manager node. Might require changes around ReaderMetricProcessor.

What is your host/environment?

OS: [e.g. iOS]
Version [e.g. 22]
Plugins

Do you have any screenshots?
If applicable, add screenshots to help explain your problem.

Do you have any additional context?
Add any other context about the problem.

The text was updated successfully, but these errors were encountered:

khushbr · 2023-03-20T20:00:58Z

Essentially, when we have a dedicated cluster manager node, we should skip collecting all metrics along dimension(see
https://opensearch.org/docs/1.0/monitoring-plugins/pa/reference/):
1/ ShardID, IndexName, Operation, ShardRole
2/ ShardID, IndexName
3/ Operation, Exception, Indices, HTTPRespCode, ShardID, IndexName, ShardRole
4/ ThreadPoolType

The Dedicated Cluster Manager is already an overloaded node w.r.t Performance Analyzer RCA, disabling metric collection for a set of metrics will help bring down the overall footprint of the PA-RCA component.

Identify a Dedicated Master Node: The way to achieve a dedicated node type is to mark all other node types as false. In this case, you have to mark all the other nodes as not cluster manager eligible. from See, https://opensearch.org/docs/latest/tuning-your-cluster/cluster/

khushbr · 2023-03-24T22:37:16Z

This would require changing the code logic in ClusterDetailsEventProcessor to categorize the nodes into 3 :Data, Co-located Cluster manager and Dedicated Cluster manager.

In case of Dedicated Cluster manager, the nodes should not be collecting any metrics relevant to shards/index as they contain no data.

https://github.com/opensearch-project/performance-analyzer-rca/blob/main/src/main/java/org/opensearch/performanceanalyzer/reader/ClusterDetailsEventProcessor.java#L232

Tjofil · 2023-03-26T23:37:45Z

As mentioned in the comment above and in #308, these errors can be prevented by limiting certain RCA node executions to certain OS node roles. Note that these propose a framework change, which will not prevent "incorrect" usage of tags and framework itself, and exceptions like ones from the issue description would still be possible. Those scenarios should also be gracefully handled so let this remain a separate issue from 308.

sgup432 · 2023-03-29T23:21:21Z

Mostly agree with the proposal mentioned here ie to not collect these metrics for dedicated cluster manager node.

But we should also consider non dedicated cluster manager nodes as well, right? As it doesn't make here as well.

Identification of cluster manager nodes should be possible in this package as already have mechanisms in place where RCA has domain level info ie which nodes are of cluster manager or data roles. RCA framework itself relies on it.

Tjofil · 2023-03-30T10:05:41Z

@sgup432 you're right. I talked about that in #308. Currently we have to choose between no-effect dedicated Cluster Manager execution if we want correctness, and no execution at non dedicated Cluster Manager nodes if we want performance, #308 proposes framework change in order to achieve both correctness and performance without tradeoffs.

Tjofil · 2023-04-18T16:21:48Z

As the node's decision for which tags to apply comes down to RCA .conf files, there appear to be three main approaches for making this change:

Changing the structure of the locus tags inside .conf files to include hybrid roles.
Setting this tag and the conf file as a whole is already the user's/administrator's responsibility - i.e. choosing a persisting storage type, defining RCA graphs, tuning RCA-specific settings and thresholds and choosing tags for the node directly linked to a certain conf file. With this change, it would include user being aware of the cluster setup (if the CM is dedicated or not) and giving it a corresponding tag. The tag would remain the same if the CM is dedicated and it would be different (hybrid) if it's not. With this change, RCA nodes that shouldn't be executed at dedicated CM nodes, like i.e. HotShardRCA can now safely be marked with LOCUS_DATA_NODE as the hybrid conf file tag would execute these in case of hybrid, and analogously, cluster-manager conf file tag would not execute these in case of dedicated CM node.
Introducing a new conf file for hybrid Opensearch nodes. This would still require the aforementioned new hybrid tag. It would also make the process a bit more automated - no need for explicit set by admin (note: only for default setup), but with a cost of having to change the internal RCA role logic plus having two new conf files with mostly duplicated code to it's dedicated CM counterparts.
Abandoning the tag reliance for conf files and creating the completely internal logic for this. This one gives the most automated solution but requires the greatest framework change by far.

It is important to note that solutions from points 2 and 3 currently have indistinct feasibility and consequences because of this higher level of automation that wasn't implemented nor planned by the original creators of the framework, but left to be hand configured.

sgup432 added bug Something isn't working untriaged labels Mar 16, 2023

Tjofil mentioned this issue Mar 26, 2023

[FEATURE] Change the logic of RCA conditional execution tag system to optimally cover all use-cases #308

Open

praveensameneni removed the untriaged label Mar 30, 2023

khushbr added the v2.8.0 Issues and PRs related to version v2.8.0 label Apr 25, 2023

khushbr assigned Tjofil Apr 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Cluster manager node attempts to calculate indices cache metric #305

[BUG] Cluster manager node attempts to calculate indices cache metric #305

sgup432 commented Mar 16, 2023 •

edited

Loading

khushbr commented Mar 20, 2023 •

edited

Loading

khushbr commented Mar 24, 2023

Tjofil commented Mar 26, 2023 •

edited

Loading

sgup432 commented Mar 29, 2023

Tjofil commented Mar 30, 2023

Tjofil commented Apr 18, 2023 •

edited

Loading

[BUG] Cluster manager node attempts to calculate indices cache metric #305

[BUG] Cluster manager node attempts to calculate indices cache metric #305

Comments

sgup432 commented Mar 16, 2023 • edited Loading

khushbr commented Mar 20, 2023 • edited Loading

khushbr commented Mar 24, 2023

Tjofil commented Mar 26, 2023 • edited Loading

sgup432 commented Mar 29, 2023

Tjofil commented Mar 30, 2023

Tjofil commented Apr 18, 2023 • edited Loading

sgup432 commented Mar 16, 2023 •

edited

Loading

khushbr commented Mar 20, 2023 •

edited

Loading

Tjofil commented Mar 26, 2023 •

edited

Loading

Tjofil commented Apr 18, 2023 •

edited

Loading