bug: potential metrics leak #6855

hzxa21 · 2022-12-12T06:08:56Z

Describe the bug

In our system, metrics label are widely used but I nonticed that the labeled metrics are not cleaned when they become unused. There are potential metrics leak for the following metrics:

Per batch task metrics: should be cleaned when the task finishes.
Per actor metrics: should be cleaned when the actor is dropped or moved. #9492
Per source metrics: should be cleaned when the source is dropped.
Exchange metrics: should be cleaned when the actor is dropped or moved.
Per table storage metrics: should be cleand when the table id dropped. #9491
Per host compaction metrics: shold be cleaned when the host is offline.
Per compaction group metrics (meta side metrics are cleaned up): should be cleaned when the group is dropped.

To Reproduce

No response

Expected behavior

No response

Additional context

To fix the issue, we can cache the labeled metrics with_label_values and do remove_label_values when the metrics are no longer used. This has a side benefit to reduce label lookup when updating the metrics value.

The text was updated successfully, but these errors were encountered:

soundOfDestiny · 2022-12-14T07:35:03Z

Per compaction group metrics (meta side metrics are cleaned up): should be cleaned when the group is dropped.

now it is dealt by remove_compaction_group_in_sst_stat which is called in sync_group

hzxa21 · 2022-12-14T07:38:08Z

Per compaction group metrics (meta side metrics are cleaned up): should be cleaned when the group is dropped.

now it is dealt by remove_compaction_group_in_sst_stat which is called in sync_group

Yes, meta side compaction group metrics are already cleaned up correctly but the compactor side are still not.

Gun9niR · 2022-12-22T07:47:27Z

Cleaning the labels right after a short-living instance is dropped/finished will cause the corresponding metric value to be removed before it is collected. For example, a batch task can take only 1 sec, but Prometheus may collect once every 20s. Therefore, the stale labels can only be removed after they are collected, which means we need to implement our own Collector to be aware of the invocation of collect(). In #5770, I call reset() on all the metrics after they are collected, but it has a race condition that if data is written between collect() and reset(), it will be lost, tho it happens very rarely.

lmatz · 2023-03-22T08:55:23Z

Give an example:

The top ones come from a streaming query that has been dropped. But the metrics are kept.

fuyufjh · 2023-04-25T07:36:10Z

Need to call remove_label_values when actor is dropped.

hzxa21 added the type/bug Something isn't working label Dec 12, 2022

github-actions bot added this to the release-0.1.15 milestone Dec 12, 2022

hzxa21 modified the milestones: release-0.1.15, release-0.1.16 Dec 19, 2022

fuyufjh assigned hzxa21 Dec 28, 2022

wenym1 modified the milestones: release-0.1.16, release-0.1.17 Jan 30, 2023

hzxa21 modified the milestones: release-0.1.17, release-0.1.18 Feb 20, 2023

fuyufjh removed this from the release-0.18 milestone Mar 22, 2023

fuyufjh modified the milestones: release-0.19, release-0.20 Apr 25, 2023

fuyufjh modified the milestones: release-1.0, future-release-1.3 Aug 8, 2023

hzxa21 removed this from the release-1.3 milestone Oct 10, 2023

wenym1 mentioned this issue Oct 12, 2023

Port non-persistent metrics to label guarded one #12805

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: potential metrics leak #6855

bug: potential metrics leak #6855

hzxa21 commented Dec 12, 2022 •

edited

Loading

soundOfDestiny commented Dec 14, 2022

hzxa21 commented Dec 14, 2022

Gun9niR commented Dec 22, 2022

lmatz commented Mar 22, 2023

fuyufjh commented Apr 25, 2023

bug: potential metrics leak #6855

bug: potential metrics leak #6855

Comments

hzxa21 commented Dec 12, 2022 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

soundOfDestiny commented Dec 14, 2022

hzxa21 commented Dec 14, 2022

Gun9niR commented Dec 22, 2022

lmatz commented Mar 22, 2023

fuyufjh commented Apr 25, 2023

hzxa21 commented Dec 12, 2022 •

edited

Loading