Tracking: high cardinality metrics #18108

kwannoel · 2024-08-19T12:53:35Z

The metrics tracked here need to be fine-tuned as they have too high granularity.
Either they need to be changed from actor to fragment level, or a backup metric in MV level has to be provided, so that even if actor/fragment level metrics are too much, we can still have MV-level observability.

Actor input/output blocking time ratio fragment level metrics for actor backpressure (actor input/output blocking time ratio) #18062
Actor input/output throughput
Executor Cache Memory Usage
Executor Barrier Align
Materialize Executor Cache
Merger Barrier Align
Executor Throughput
Actor Memory Usage

fuyufjh · 2024-08-20T04:26:55Z

Currently, our metrics were managed by LabelGuardedMetric structs, each corresponding to one timeseries i.e. a metrics with a specific set of labels. This design allows the Metrics to be correctly destructed on drop MV.

In order to make an actor-level metric to fragment-level metrics, we need to introduce an Arc<LabelGuardedMetric> among all the actors in a fragment. Particularly, lock e.g. Arc<Mutex<LabelGuardedMetric>> should be avoided because the Metrics structure provided by Prometheus client library is thread-safe.

Another thing worth to mention is that the avg in PromQL query, such as

risingwave/grafana/risingwave-dev-dashboard.dashboard.py

Line 1064 in 1b77066

    
           f"avg(rate({metric('stream_actor_output_buffer_blocking_duration_ns')}[$__rate_interval])) by (fragment_id, downstream_fragment_id) / 1000000000",

won't work as expected, because the stream_actor_output_buffer_blocking_duration_ns will actually become a node-level sum if we simply use the approach of Arc<LabelGuardedMetric>. To solve this, perhaps an AVG-specific structure needs to be introduced to calculate the average when being collect().

fuyufjh · 2024-08-22T05:49:17Z

Another thing worth to mention is that the avg in PromQL query, such as

risingwave/grafana/risingwave-dev-dashboard.dashboard.py

Line 1064 in 1b77066

f"avg(rate({metric('stream_actor_output_buffer_blocking_duration_ns')}[$__rate_interval])) by (fragment_id, downstream_fragment_id) / 1000000000",

won't work as expected, because the stream_actor_output_buffer_blocking_duration_ns will actually become a node-level sum if we simply use the approach of Arc<LabelGuardedMetric>. To solve this, perhaps an AVG-specific structure needs to be introduced to calculate the average when being collect().

Oh, wait, I suddenly realized the approach above doesn't work as expected. For example, assuming we have a fragment with 8 parallelism:

fragment F:   CN A:  actor-0  actor-1  actor-2
              CN B:  actor-3  actor-4  actor-5
              CN C:  actor-6  actor-7

Here, avg of these 3 node-level avg is NOT the avg of these 8 actor, because the actors are not evenly distributed. If the distribution is more uneven, the problem will be more severe.

I am now thinking about another approach: splitting the avg to sum/count. sum can be easily aggregated locally by sharing a Counter. While for count, which means the parallelism of a fragment, we might need to introduce a new metrics and reuse it among all these listed metrics.

kwannoel · 2024-08-22T06:00:35Z

What about having these metrics aggregated at a fragment per worker level, rather than fragment level? Is that easier?

fuyufjh · 2024-08-22T06:01:36Z

What about having these metrics aggregated at a fragment per worker level, rather than fragment level? Is that easier?

How can you get other CNs' metrics in the current CN's code? RPC? 🤣

BugenZhao · 2024-08-22T06:43:04Z

we need to introduce an Arc<LabelGuardedMetric> among all the actors in a fragment.

Note that Arc only work within a single process, to avoid dropping the metric unexpectedly, I suppose we still need another label of worker_id for those metrics to differentiate between different worker nodes. Am I understanding correctly?

Regarding the PromQL query, the difference is that we lose the information about how many actors (in the same fragment) are there in each worker node. To avoid bias on the final result, especially when there's any skew in the scheduling of the actors, we need to perform some manual weighted averaging.

I feel like we are essentially doing some sort of pre-aggregation that Prometheus could do on its own. I'm just wondering if it's really a common practice to trade-off development complexity for performance, or we are too arbitrary to reject high-cardinality metrics as a solution to performance issues.

fuyufjh · 2024-08-22T07:02:15Z

I suppose we still need another label of worker_id for those metrics to differentiate between different worker nodes. Am I understanding correctly?

No need, because the collector will add another instance label to the metrics.

I feel like we are essentially doing some sort of pre-aggregation that Prometheus could do on its own. I'm just wondering if it's really a common practice to trade-off development complexity for performance, or we are too arbitrary to reject high-cardinality metrics as a solution to performance issues.

Agree. The approch is dirty and ugly, but @arkbriar strongly pushed so.

BugenZhao · 2024-08-22T07:13:13Z

No need, because the collector will add another instance label to the metrics.

Only in Cloud or k8s deployments, right?

fuyufjh · 2024-08-22T07:21:41Z

Only in Cloud or k8s deployments, right?

True, but I believe it's a common practice. Not all systems assign node_id like RisingWave, so I suppose they must need such an additional label.

arkbriar · 2024-08-22T08:20:10Z

I feel like we are essentially doing some sort of pre-aggregation that Prometheus could do on its own. I'm just wondering if it's really a common practice to trade-off development complexity for performance, or we are too arbitrary to reject high-cardinality metrics as a solution to performance issues.

It is commonly acknowledged that high cardinality metrics will significantly impact the performance and availability of time series databases. If you think it's arbitrary, please provide a solution to keep the TSDB stable as there are a ridiculous amount of 0.6M time series emitting from one of the running compute nodes and I personally am not able to manage it.

fuyufjh · 2024-08-22T09:22:13Z

We've had many discussions about this issue. Overall, I vote +1 for it because our system can indeed support 10k-100k actors per node, while the monitoring stack especially timeseries databases can't hold so many metrics, or can host with much more cost than RW itself. 😂

Although it's not a perfect solution, I believe it's the best we can do under the circumstances. I'm also trying to minimize the invasion of the implementation, #18108 (comment) tries to offer a way to work for both the design is enabled or disabled.

fuyufjh · 2024-08-22T09:27:47Z

Half a year ago, I have raised my concerns and put aside the requirement. Related discussions can be found in the above issue. I vote +1 for it now after serious considerations and trade-off.

BugenZhao · 2024-08-23T03:12:07Z

I suppose we still need another label of worker_id for those metrics to differentiate between different worker nodes. Am I understanding correctly?

Update: this is automatically done by Prometheus: https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series

kwannoel added the type/feature label Aug 19, 2024

github-actions bot added this to the release-2.1 milestone Aug 19, 2024

fuyufjh assigned BugenZhao Aug 22, 2024

fuyufjh mentioned this issue Aug 22, 2024

Incredible many metrics exported from meta and compute nodes #14821

Closed

BugenZhao mentioned this issue Aug 23, 2024

refactor(metrics): remove actor_id label from back-pressure metrics based on metrics level #18213

Merged

4 tasks

BugenZhao mentioned this issue Aug 30, 2024

refactor(metrics): suppress more actor-level metrics #18318

Merged

4 tasks

BugenZhao closed this as completed Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking: high cardinality metrics #18108

Tracking: high cardinality metrics #18108

kwannoel commented Aug 19, 2024 •

edited by BugenZhao

Loading

fuyufjh commented Aug 20, 2024 •

edited

Loading

fuyufjh commented Aug 22, 2024

kwannoel commented Aug 22, 2024

fuyufjh commented Aug 22, 2024

BugenZhao commented Aug 22, 2024

fuyufjh commented Aug 22, 2024

BugenZhao commented Aug 22, 2024

fuyufjh commented Aug 22, 2024

arkbriar commented Aug 22, 2024

fuyufjh commented Aug 22, 2024 •

edited

Loading

fuyufjh commented Aug 22, 2024 •

edited

Loading

BugenZhao commented Aug 23, 2024

Tracking: high cardinality metrics #18108

Tracking: high cardinality metrics #18108

Comments

kwannoel commented Aug 19, 2024 • edited by BugenZhao Loading

fuyufjh commented Aug 20, 2024 • edited Loading

fuyufjh commented Aug 22, 2024

kwannoel commented Aug 22, 2024

fuyufjh commented Aug 22, 2024

BugenZhao commented Aug 22, 2024

fuyufjh commented Aug 22, 2024

BugenZhao commented Aug 22, 2024

fuyufjh commented Aug 22, 2024

arkbriar commented Aug 22, 2024

fuyufjh commented Aug 22, 2024 • edited Loading

fuyufjh commented Aug 22, 2024 • edited Loading

BugenZhao commented Aug 23, 2024

kwannoel commented Aug 19, 2024 •

edited by BugenZhao

Loading

fuyufjh commented Aug 20, 2024 •

edited

Loading

fuyufjh commented Aug 22, 2024 •

edited

Loading

fuyufjh commented Aug 22, 2024 •

edited

Loading