-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incredible many metrics exported from meta and compute nodes #14821
Comments
The number "31072" reminds me of the case that the longevity test runs a lot of actors parallelly. If I remember correctly, the total number of actors is exactly 31072. The workload of longevity test might be a little bit extreme, but this is what it's designed to do. Other metrics in CN are also because of the big amount of actors and tables. For now, I don't have any better ideas to reduce the size. Recording metrics at actor level sounds totally reasonable to me. |
Particularly regarding of |
It is until there are too many actors. The number amplifies when there are more than one nodes and the actors scheduled around. |
Do we have any idea about when it will become a problem? e.g., what's the current pressure of our prometheus? From here I see:
|
That's true, but at the moment I think the root problem is becoming: Why there are so many actors? It's possible to aggregate the metrics by fragment before collecting, although I think it's not a best practice. By definition, an actor is a basic unit of execution. You can understand it as a worker thread in normal applications. |
I'm pretty much sure that it's not the case. Prometheus' implementation is notorious for huge memory consumption. You can find a lot of criticisms online. |
I'm sure that the problem is we are not expected to record actor level metrics with Prometheus, or any other kind of TSDB. Unlike AP systems, they are not made for dealing with high cardinality data. Quote from https://prometheus.io/docs/practices/naming/
|
So do we have any idea about how high is "high cardinality data" (Is 1k or 10k acceptable?)? I'm thinking that the examples (such as user IDs, email addresses) are definitely high cardinality, but the number of actors is a little controversial. 🤔 |
But one user do suffer from performance issue of Prometheus (Grafana slowness and missing metrics). They have quite many actors.
|
Acceptable as long as it won't change over the time. Which is to say, [0, 10000) forever is alright. But a dynamic range [x, x+10000) where x changes over time isn't. Regarding cardinality, it stands for the number of label values over a considerable long period, which is quite different from other DB systems. |
This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned. |
New progress: #18108 |
Describe the bug
As title. Top ones:
metrics.txt
metrics.compute.txt
Error message/log
No response
To Reproduce
No response
Expected behavior
No response
How did you deploy RisingWave?
No response
The version of RisingWave
nightly-20240125-disable-embedded-tracing
Additional context
No response
The text was updated successfully, but these errors were encountered: