Per actor metrics: should be cleaned when the actor is dropped or moved. #9492

fuyufjh · 2023-04-26T11:37:16Z

Porblem

The leaked actor memory not only consumes extra memory but also affect the metrics. As the screenshot shows, the Actor 24 has already been dropped, but the metrics still exist.

Cause

The is caused by the design of MetricVec (actually a hashmap of labels -> single metrics) in the Prometheus client library.

For example,

pub agg_cached_keys: GenericGaugeVec<AtomicI64>,

which is actually backed by MetricVec

/// The underlying implementation for [`GaugeVec`] and [`IntGaugeVec`].
pub type GenericGaugeVec<P> = MetricVec<GaugeVecBuilder<P>>;

When call it with with_label_values, a new key (label) will be created in that hashmap e.g.

this.metrics
    .agg_cached_keys
    .with_label_values(&[&table_id_str, &actor_id_str])
    .set(vars.agg_group_cache.len() as i64);

But it's never been removed.

Solution

Similar to LockGuard, one solution I can tell is to wrap the MetricVec varaibles e.g. agg_cached_keys within a handler object, and remove itselves' key (label) from MetricVec's hashmap when being dropped.

The text was updated successfully, but these errors were encountered:

fuyufjh · 2023-05-09T06:25:12Z

Quite annoying when investigating problems... 🥲 Hope to be fixed

MrCroxx · 2023-05-10T08:24:01Z

IMO, the streaming actor metrics imitate how the batch task metrics are cleaned (with customized Collector). Related issues: #3832 #4844 #5742. What are your opinions? @fuyufjh @ZENOTME

MrCroxx · 2023-05-10T08:32:49Z

And currently StreamingMetrics contains metrics of task level, actor level, executor level, etc, which can be split into multiple metrics for better lifetime management.

fuyufjh · 2023-05-10T08:34:22Z

IMO, the streaming actor metrics imitate how the batch task metrics are cleaned (with customized Collector). Related issues: #3832 #4844 #5742. What are your opinions? @fuyufjh @ZENOTME

Sounds good to me. Similar to the solution that I imagined before (described in the PR's desciption) i.e. using something to hold the lifetime of these actor-level metrics

ZENOTME · 2023-05-10T08:37:54Z

IMO, the streaming actor metrics imitate how the batch task metrics are cleaned (with customized Collector). Related issues: #3832 #4844 #5742. What are your opinions? @fuyufjh @ZENOTME

+1. I think it's the same problem in batch and it solved in #9378.

MrCroxx · 2023-05-10T09:02:08Z

One thing that makes streaming metrics harder to clean is that the labels of the streaming metrics are not the same. 🤣 Let me try to register/collect them with as less modifications as possible.

fuyufjh · 2023-07-14T04:53:23Z

@MrCroxx Any further updates?

MrCroxx · 2023-08-08T11:57:05Z

Worked on the new file cache engine before. Let me get back to this PR these days.

lmatz · 2023-08-18T09:40:32Z

The example above is an aggregation.
Als have a case for join.

fuyufjh · 2023-09-11T06:50:02Z

@MrCroxx any updates? 👀

fuyufjh mentioned this issue Apr 26, 2023

bug: potential metrics leak #6855

Open

7 tasks

github-actions bot added this to the release-0.20 milestone Apr 26, 2023

fuyufjh added good first issue Good for newcomers help wanted Issues that need help from contributors labels Apr 26, 2023

fuyufjh assigned MrCroxx May 9, 2023

fuyufjh removed the good first issue Good for newcomers label Aug 8, 2023

fuyufjh modified the milestones: release-1.0, release-1.2 Aug 8, 2023

fuyufjh modified the milestones: release-1.2, release-1.3 Sep 11, 2023

fuyufjh mentioned this issue Oct 17, 2023

feat(metrics): support reference counting in metrics label #12882

Merged

8 tasks

fuyufjh closed this as completed Nov 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per actor metrics: should be cleaned when the actor is dropped or moved. #9492

Per actor metrics: should be cleaned when the actor is dropped or moved. #9492

fuyufjh commented Apr 26, 2023 •

edited

Loading

fuyufjh commented May 9, 2023

MrCroxx commented May 10, 2023

MrCroxx commented May 10, 2023

fuyufjh commented May 10, 2023

ZENOTME commented May 10, 2023 •

edited

Loading

MrCroxx commented May 10, 2023

fuyufjh commented Jul 14, 2023

MrCroxx commented Aug 8, 2023

lmatz commented Aug 18, 2023

fuyufjh commented Sep 11, 2023

Per actor metrics: should be cleaned when the actor is dropped or moved. #9492

Per actor metrics: should be cleaned when the actor is dropped or moved. #9492

Comments

fuyufjh commented Apr 26, 2023 • edited Loading

Porblem

Cause

Solution

fuyufjh commented May 9, 2023

MrCroxx commented May 10, 2023

MrCroxx commented May 10, 2023

fuyufjh commented May 10, 2023

ZENOTME commented May 10, 2023 • edited Loading

MrCroxx commented May 10, 2023

fuyufjh commented Jul 14, 2023

MrCroxx commented Aug 8, 2023

lmatz commented Aug 18, 2023

fuyufjh commented Sep 11, 2023

fuyufjh commented Apr 26, 2023 •

edited

Loading

ZENOTME commented May 10, 2023 •

edited

Loading