Incredible many metrics exported from meta and compute nodes #14821

arkbriar · 2024-01-26T08:57:02Z

Describe the bug

As title. Top ones:

From meta,
metrics.txt

31072 actor_info
2994 storage_version_stats
1206 table_info

From compute
metrics.compute.txt

18229 stream_actor_output_buffer_blocking_duration_ns
11568 block_efficiency_histogram_bucket
10375 stream_actor_input_buffer_blocking_duration_ns
10375 stream_actor_in_record_cnt
10359 stream_actor_out_record_cnt
4698 state_store_sst_store_block_request_counts
4576 stream_join_barrier_align_duration_bucket
3294 stream_executor_row_count
3132 state_store_iter_scan_key_counts
2929 stream_join_matched_join_keys_bucket
2467 stream_memory_usage
2467 lru_evicted_watermark_time_ms
2349 state_store_read_req_positive_but_non_exist_counts
2349 state_store_read_req_check_bloom_filter_counts
2349 state_store_read_req_bloom_filter_positive_counts

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

nightly-20240125-disable-embedded-tracing

Additional context

No response

The text was updated successfully, but these errors were encountered:

fuyufjh · 2024-03-07T08:16:09Z

31072 actor_info

The number "31072" reminds me of the case that the longevity test runs a lot of actors parallelly. If I remember correctly, the total number of actors is exactly 31072. The workload of longevity test might be a little bit extreme, but this is what it's designed to do.

Other metrics in CN are also because of the big amount of actors and tables.

For now, I don't have any better ideas to reduce the size. Recording metrics at actor level sounds totally reasonable to me.

fuyufjh · 2024-03-07T08:19:25Z

Particularly regarding of actor_info, this is a "dummy" metrics storing actor information as labels and the value is always 1. This is because in Prometheus data model, this seems to be the only way to a table. Without it, one needs to access the RisingWave psql endpoint, which might not be always available.

arkbriar · 2024-03-07T09:21:54Z

Recording metrics at actor level sounds totally reasonable to me.

It is until there are too many actors. The number amplifies when there are more than one nodes and the actors scheduled around.

xxchan · 2024-03-08T10:19:04Z

Recording metrics at actor level sounds totally reasonable to me.

It is until there are too many actors. The number amplifies when there are more than one nodes and the actors scheduled around.

Do we have any idea about when it will become a problem? e.g., what's the current pressure of our prometheus?

From here I see:

A typical Prometheus server can handle on the order of 10 million metrics before you start to see limitations

fuyufjh · 2024-03-18T05:14:12Z

Recording metrics at actor level sounds totally reasonable to me.

It is until there are too many actors. The number amplifies when there are more than one nodes and the actors scheduled around.

That's true, but at the moment I think the root problem is becoming: Why there are so many actors?

It's possible to aggregate the metrics by fragment before collecting, although I think it's not a best practice. By definition, an actor is a basic unit of execution. You can understand it as a worker thread in normal applications.

arkbriar · 2024-03-18T10:17:33Z

A typical Prometheus server can handle on the order of 10 million metrics before you start to see limitations

I'm pretty much sure that it's not the case. Prometheus' implementation is notorious for huge memory consumption. You can find a lot of criticisms online.

arkbriar · 2024-03-18T10:25:29Z

That's true, but at the moment I think the root problem is becoming: Why there are so many actors?

I'm sure that the problem is we are not expected to record actor level metrics with Prometheus, or any other kind of TSDB. Unlike AP systems, they are not made for dealing with high cardinality data.

Quote from https://prometheus.io/docs/practices/naming/

CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.

xxchan · 2024-03-18T16:42:25Z

So do we have any idea about how high is "high cardinality data" (Is 1k or 10k acceptable?)? I'm thinking that the examples (such as user IDs, email addresses) are definitely high cardinality, but the number of actors is a little controversial. 🤔

xxchan · 2024-03-18T16:47:44Z

But one user do suffer from performance issue of Prometheus (Grafana slowness and missing metrics).

They have quite many actors.

select worker_id, count(*) from rw_actors a, rw_parallel_units p where a.parallel_unit_id = p.id group by p.worker_id;

worker_id|count|
---------+-----+
    26003|15374|
    26002|11664|
    26004| 7920|

arkbriar · 2024-03-19T07:03:02Z

Is 1k or 10k acceptable?

Acceptable as long as it won't change over the time. Which is to say, [0, 10000) forever is alright. But a dynamic range [x, x+10000) where x changes over time isn't.

Regarding cardinality, it stands for the number of label values over a considerable long period, which is quite different from other DB systems.

github-actions · 2024-06-12T08:57:44Z

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

fuyufjh · 2024-08-22T09:27:02Z

New progress: #18108

arkbriar added the type/bug Something isn't working label Jan 26, 2024

github-actions bot added this to the release-1.7 milestone Jan 26, 2024

fuyufjh modified the milestones: release-1.7, release-1.8 Mar 6, 2024

fuyufjh self-assigned this Mar 6, 2024

fuyufjh mentioned this issue Mar 6, 2024

tracking: refactor metrics with LabelGuarded #14838

Open

6 tasks

fuyufjh removed the type/bug Something isn't working label Mar 7, 2024

fuyufjh removed this from the release-1.8 milestone Mar 7, 2024

fuyufjh mentioned this issue Apr 2, 2024

discussion: limit the number of (stateful) actors per CN #16092

Open

github-actions bot added the no-issue-activity label Jun 12, 2024

xxchan removed the no-issue-activity label Jun 13, 2024

fuyufjh closed this as completed Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incredible many metrics exported from meta and compute nodes #14821

Incredible many metrics exported from meta and compute nodes #14821

arkbriar commented Jan 26, 2024

fuyufjh commented Mar 7, 2024

fuyufjh commented Mar 7, 2024

arkbriar commented Mar 7, 2024

xxchan commented Mar 8, 2024

fuyufjh commented Mar 18, 2024

arkbriar commented Mar 18, 2024

arkbriar commented Mar 18, 2024

xxchan commented Mar 18, 2024

xxchan commented Mar 18, 2024

arkbriar commented Mar 19, 2024 •

edited

Loading

github-actions bot commented Jun 12, 2024

fuyufjh commented Aug 22, 2024

Incredible many metrics exported from meta and compute nodes #14821

Incredible many metrics exported from meta and compute nodes #14821

Comments

arkbriar commented Jan 26, 2024

Describe the bug

Error message/log

To Reproduce

Expected behavior

How did you deploy RisingWave?

The version of RisingWave

Additional context

fuyufjh commented Mar 7, 2024

fuyufjh commented Mar 7, 2024

arkbriar commented Mar 7, 2024

xxchan commented Mar 8, 2024

fuyufjh commented Mar 18, 2024

arkbriar commented Mar 18, 2024

arkbriar commented Mar 18, 2024

xxchan commented Mar 18, 2024

xxchan commented Mar 18, 2024

arkbriar commented Mar 19, 2024 • edited Loading

github-actions bot commented Jun 12, 2024

fuyufjh commented Aug 22, 2024

arkbriar commented Mar 19, 2024 •

edited

Loading