Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking: high cardinality metrics #18108

Closed
1 of 8 tasks
kwannoel opened this issue Aug 19, 2024 · 12 comments
Closed
1 of 8 tasks

Tracking: high cardinality metrics #18108

kwannoel opened this issue Aug 19, 2024 · 12 comments
Assignees
Milestone

Comments

@kwannoel
Copy link
Contributor

kwannoel commented Aug 19, 2024

The metrics tracked here need to be fine-tuned as they have too high granularity.
Either they need to be changed from actor to fragment level, or a backup metric in MV level has to be provided, so that even if actor/fragment level metrics are too much, we can still have MV-level observability.

@github-actions github-actions bot added this to the release-2.1 milestone Aug 19, 2024
@fuyufjh
Copy link
Member

fuyufjh commented Aug 20, 2024

Currently, our metrics were managed by LabelGuardedMetric structs, each corresponding to one timeseries i.e. a metrics with a specific set of labels. This design allows the Metrics to be correctly destructed on drop MV.

In order to make an actor-level metric to fragment-level metrics, we need to introduce an Arc<LabelGuardedMetric> among all the actors in a fragment. Particularly, lock e.g. Arc<Mutex<LabelGuardedMetric>> should be avoided because the Metrics structure provided by Prometheus client library is thread-safe.

Another thing worth to mention is that the avg in PromQL query, such as

f"avg(rate({metric('stream_actor_output_buffer_blocking_duration_ns')}[$__rate_interval])) by (fragment_id, downstream_fragment_id) / 1000000000",

won't work as expected, because the stream_actor_output_buffer_blocking_duration_ns will actually become a node-level sum if we simply use the approach of Arc<LabelGuardedMetric>. To solve this, perhaps an AVG-specific structure needs to be introduced to calculate the average when being collect().

@fuyufjh
Copy link
Member

fuyufjh commented Aug 22, 2024

Another thing worth to mention is that the avg in PromQL query, such as

f"avg(rate({metric('stream_actor_output_buffer_blocking_duration_ns')}[$__rate_interval])) by (fragment_id, downstream_fragment_id) / 1000000000",

won't work as expected, because the stream_actor_output_buffer_blocking_duration_ns will actually become a node-level sum if we simply use the approach of Arc<LabelGuardedMetric>. To solve this, perhaps an AVG-specific structure needs to be introduced to calculate the average when being collect().

Oh, wait, I suddenly realized the approach above doesn't work as expected. For example, assuming we have a fragment with 8 parallelism:

fragment F:   CN A:  actor-0  actor-1  actor-2
              CN B:  actor-3  actor-4  actor-5
              CN C:  actor-6  actor-7

Here, avg of these 3 node-level avg is NOT the avg of these 8 actor, because the actors are not evenly distributed. If the distribution is more uneven, the problem will be more severe.


I am now thinking about another approach: splitting the avg to sum/count. sum can be easily aggregated locally by sharing a Counter. While for count, which means the parallelism of a fragment, we might need to introduce a new metrics and reuse it among all these listed metrics.

@kwannoel
Copy link
Contributor Author

What about having these metrics aggregated at a fragment per worker level, rather than fragment level? Is that easier?

@fuyufjh
Copy link
Member

fuyufjh commented Aug 22, 2024

What about having these metrics aggregated at a fragment per worker level, rather than fragment level? Is that easier?

How can you get other CNs' metrics in the current CN's code? RPC? 🤣

@BugenZhao
Copy link
Member

we need to introduce an Arc<LabelGuardedMetric> among all the actors in a fragment.

Note that Arc only work within a single process, to avoid dropping the metric unexpectedly, I suppose we still need another label of worker_id for those metrics to differentiate between different worker nodes. Am I understanding correctly?

Regarding the PromQL query, the difference is that we lose the information about how many actors (in the same fragment) are there in each worker node. To avoid bias on the final result, especially when there's any skew in the scheduling of the actors, we need to perform some manual weighted averaging.


I feel like we are essentially doing some sort of pre-aggregation that Prometheus could do on its own. I'm just wondering if it's really a common practice to trade-off development complexity for performance, or we are too arbitrary to reject high-cardinality metrics as a solution to performance issues.

@fuyufjh
Copy link
Member

fuyufjh commented Aug 22, 2024

I suppose we still need another label of worker_id for those metrics to differentiate between different worker nodes. Am I understanding correctly?

No need, because the collector will add another instance label to the metrics.

I feel like we are essentially doing some sort of pre-aggregation that Prometheus could do on its own. I'm just wondering if it's really a common practice to trade-off development complexity for performance, or we are too arbitrary to reject high-cardinality metrics as a solution to performance issues.

Agree. The approch is dirty and ugly, but @arkbriar strongly pushed so.

@BugenZhao
Copy link
Member

No need, because the collector will add another instance label to the metrics.

Only in Cloud or k8s deployments, right?

@fuyufjh
Copy link
Member

fuyufjh commented Aug 22, 2024

Only in Cloud or k8s deployments, right?

True, but I believe it's a common practice. Not all systems assign node_id like RisingWave, so I suppose they must need such an additional label.

@arkbriar
Copy link
Contributor

I feel like we are essentially doing some sort of pre-aggregation that Prometheus could do on its own. I'm just wondering if it's really a common practice to trade-off development complexity for performance, or we are too arbitrary to reject high-cardinality metrics as a solution to performance issues.

It is commonly acknowledged that high cardinality metrics will significantly impact the performance and availability of time series databases. If you think it's arbitrary, please provide a solution to keep the TSDB stable as there are a ridiculous amount of 0.6M time series emitting from one of the running compute nodes and I personally am not able to manage it.

@fuyufjh
Copy link
Member

fuyufjh commented Aug 22, 2024

We've had many discussions about this issue. Overall, I vote +1 for it because our system can indeed support 10k-100k actors per node, while the monitoring stack especially timeseries databases can't hold so many metrics, or can host with much more cost than RW itself. 😂

Although it's not a perfect solution, I believe it's the best we can do under the circumstances. I'm also trying to minimize the invasion of the implementation, #18108 (comment) tries to offer a way to work for both the design is enabled or disabled.

@fuyufjh
Copy link
Member

fuyufjh commented Aug 22, 2024

Half a year ago, I have raised my concerns and put aside the requirement. Related discussions can be found in the above issue. I vote +1 for it now after serious considerations and trade-off.

@BugenZhao
Copy link
Member

I suppose we still need another label of worker_id for those metrics to differentiate between different worker nodes. Am I understanding correctly?

Update: this is automatically done by Prometheus: https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants