Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incredible many metrics exported from meta and compute nodes #14821

Closed
arkbriar opened this issue Jan 26, 2024 · 12 comments
Closed

Incredible many metrics exported from meta and compute nodes #14821

arkbriar opened this issue Jan 26, 2024 · 12 comments
Assignees

Comments

@arkbriar
Copy link
Contributor

Describe the bug

As title. Top ones:

  1. From meta,
    metrics.txt
31072 actor_info
2994 storage_version_stats
1206 table_info
  1. From compute
    metrics.compute.txt
18229 stream_actor_output_buffer_blocking_duration_ns
11568 block_efficiency_histogram_bucket
10375 stream_actor_input_buffer_blocking_duration_ns
10375 stream_actor_in_record_cnt
10359 stream_actor_out_record_cnt
4698 state_store_sst_store_block_request_counts
4576 stream_join_barrier_align_duration_bucket
3294 stream_executor_row_count
3132 state_store_iter_scan_key_counts
2929 stream_join_matched_join_keys_bucket
2467 stream_memory_usage
2467 lru_evicted_watermark_time_ms
2349 state_store_read_req_positive_but_non_exist_counts
2349 state_store_read_req_check_bloom_filter_counts
2349 state_store_read_req_bloom_filter_positive_counts

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

nightly-20240125-disable-embedded-tracing

Additional context

No response

@arkbriar arkbriar added the type/bug Something isn't working label Jan 26, 2024
@github-actions github-actions bot added this to the release-1.7 milestone Jan 26, 2024
@fuyufjh fuyufjh modified the milestones: release-1.7, release-1.8 Mar 6, 2024
@fuyufjh fuyufjh self-assigned this Mar 6, 2024
@fuyufjh
Copy link
Member

fuyufjh commented Mar 7, 2024

31072 actor_info

The number "31072" reminds me of the case that the longevity test runs a lot of actors parallelly. If I remember correctly, the total number of actors is exactly 31072. The workload of longevity test might be a little bit extreme, but this is what it's designed to do.

Other metrics in CN are also because of the big amount of actors and tables.

For now, I don't have any better ideas to reduce the size. Recording metrics at actor level sounds totally reasonable to me.

@fuyufjh
Copy link
Member

fuyufjh commented Mar 7, 2024

Particularly regarding of actor_info, this is a "dummy" metrics storing actor information as labels and the value is always 1. This is because in Prometheus data model, this seems to be the only way to a table. Without it, one needs to access the RisingWave psql endpoint, which might not be always available.

@fuyufjh fuyufjh removed the type/bug Something isn't working label Mar 7, 2024
@fuyufjh fuyufjh removed this from the release-1.8 milestone Mar 7, 2024
@arkbriar
Copy link
Contributor Author

arkbriar commented Mar 7, 2024

Recording metrics at actor level sounds totally reasonable to me.

It is until there are too many actors. The number amplifies when there are more than one nodes and the actors scheduled around.

@xxchan
Copy link
Member

xxchan commented Mar 8, 2024

Recording metrics at actor level sounds totally reasonable to me.

It is until there are too many actors. The number amplifies when there are more than one nodes and the actors scheduled around.

Do we have any idea about when it will become a problem? e.g., what's the current pressure of our prometheus?

From here I see:

A typical Prometheus server can handle on the order of 10 million metrics before you start to see limitations

@fuyufjh
Copy link
Member

fuyufjh commented Mar 18, 2024

Recording metrics at actor level sounds totally reasonable to me.

It is until there are too many actors. The number amplifies when there are more than one nodes and the actors scheduled around.

That's true, but at the moment I think the root problem is becoming: Why there are so many actors?

It's possible to aggregate the metrics by fragment before collecting, although I think it's not a best practice. By definition, an actor is a basic unit of execution. You can understand it as a worker thread in normal applications.

@arkbriar
Copy link
Contributor Author

A typical Prometheus server can handle on the order of 10 million metrics before you start to see limitations

I'm pretty much sure that it's not the case. Prometheus' implementation is notorious for huge memory consumption. You can find a lot of criticisms online.

@arkbriar
Copy link
Contributor Author

That's true, but at the moment I think the root problem is becoming: Why there are so many actors?

I'm sure that the problem is we are not expected to record actor level metrics with Prometheus, or any other kind of TSDB. Unlike AP systems, they are not made for dealing with high cardinality data.

Quote from https://prometheus.io/docs/practices/naming/

CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.

@xxchan
Copy link
Member

xxchan commented Mar 18, 2024

So do we have any idea about how high is "high cardinality data" (Is 1k or 10k acceptable?)? I'm thinking that the examples (such as user IDs, email addresses) are definitely high cardinality, but the number of actors is a little controversial. 🤔

@xxchan
Copy link
Member

xxchan commented Mar 18, 2024

But one user do suffer from performance issue of Prometheus (Grafana slowness and missing metrics).

They have quite many actors.

select worker_id, count(*) from rw_actors a, rw_parallel_units p where a.parallel_unit_id = p.id group by p.worker_id;

worker_id|count|
---------+-----+
    26003|15374|
    26002|11664|
    26004| 7920|

@arkbriar
Copy link
Contributor Author

arkbriar commented Mar 19, 2024

Is 1k or 10k acceptable?

Acceptable as long as it won't change over the time. Which is to say, [0, 10000) forever is alright. But a dynamic range [x, x+10000) where x changes over time isn't.

Regarding cardinality, it stands for the number of label values over a considerable long period, which is quite different from other DB systems.

Copy link
Contributor

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

@fuyufjh
Copy link
Member

fuyufjh commented Aug 22, 2024

New progress: #18108

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants