-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking: high cardinality metrics #18108
Comments
Currently, our metrics were managed by In order to make an actor-level metric to fragment-level metrics, we need to introduce an Another thing worth to mention is that the
won't work as expected, because the |
Oh, wait, I suddenly realized the approach above doesn't work as expected. For example, assuming we have a fragment with 8 parallelism:
Here, avg of these 3 node-level avg is NOT the avg of these 8 actor, because the actors are not evenly distributed. If the distribution is more uneven, the problem will be more severe. I am now thinking about another approach: splitting the |
What about having these metrics aggregated at a |
How can you get other CNs' metrics in the current CN's code? RPC? 🤣 |
Note that Regarding the PromQL query, the difference is that we lose the information about how many actors (in the same fragment) are there in each worker node. To avoid bias on the final result, especially when there's any skew in the scheduling of the actors, we need to perform some manual weighted averaging. I feel like we are essentially doing some sort of pre-aggregation that Prometheus could do on its own. I'm just wondering if it's really a common practice to trade-off development complexity for performance, or we are too arbitrary to reject high-cardinality metrics as a solution to performance issues. |
No need, because the collector will add another
Agree. The approch is dirty and ugly, but @arkbriar strongly pushed so. |
Only in Cloud or k8s deployments, right? |
True, but I believe it's a common practice. Not all systems assign |
It is commonly acknowledged that high cardinality metrics will significantly impact the performance and availability of time series databases. If you think it's arbitrary, please provide a solution to keep the TSDB stable as there are a ridiculous amount of 0.6M time series emitting from one of the running compute nodes and I personally am not able to manage it. |
We've had many discussions about this issue. Overall, I vote +1 for it because our system can indeed support 10k-100k actors per node, while the monitoring stack especially timeseries databases can't hold so many metrics, or can host with much more cost than RW itself. 😂 Although it's not a perfect solution, I believe it's the best we can do under the circumstances. I'm also trying to minimize the invasion of the implementation, #18108 (comment) tries to offer a way to work for both the design is enabled or disabled. |
Half a year ago, I have raised my concerns and put aside the requirement. Related discussions can be found in the above issue. I vote +1 for it now after serious considerations and trade-off. |
Update: this is automatically done by Prometheus: https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series |
The metrics tracked here need to be fine-tuned as they have too high granularity.
Either they need to be changed from actor to fragment level, or a backup metric in MV level has to be provided, so that even if actor/fragment level metrics are too much, we can still have MV-level observability.
The text was updated successfully, but these errors were encountered: