Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: potential metrics leak #6855

Open
3 of 7 tasks
hzxa21 opened this issue Dec 12, 2022 · 5 comments
Open
3 of 7 tasks

bug: potential metrics leak #6855

hzxa21 opened this issue Dec 12, 2022 · 5 comments
Assignees
Labels
type/bug Something isn't working

Comments

@hzxa21
Copy link
Collaborator

hzxa21 commented Dec 12, 2022

Describe the bug

In our system, metrics label are widely used but I nonticed that the labeled metrics are not cleaned when they become unused. There are potential metrics leak for the following metrics:

To Reproduce

No response

Expected behavior

No response

Additional context

To fix the issue, we can cache the labeled metrics with_label_values and do remove_label_values when the metrics are no longer used. This has a side benefit to reduce label lookup when updating the metrics value.

@hzxa21 hzxa21 added the type/bug Something isn't working label Dec 12, 2022
@github-actions github-actions bot added this to the release-0.1.15 milestone Dec 12, 2022
@soundOfDestiny
Copy link
Contributor

Per compaction group metrics (meta side metrics are cleaned up): should be cleaned when the group is dropped.

now it is dealt by remove_compaction_group_in_sst_stat which is called in sync_group

@hzxa21
Copy link
Collaborator Author

hzxa21 commented Dec 14, 2022

Per compaction group metrics (meta side metrics are cleaned up): should be cleaned when the group is dropped.

now it is dealt by remove_compaction_group_in_sst_stat which is called in sync_group

Yes, meta side compaction group metrics are already cleaned up correctly but the compactor side are still not.

@hzxa21 hzxa21 modified the milestones: release-0.1.15, release-0.1.16 Dec 19, 2022
@Gun9niR
Copy link
Contributor

Gun9niR commented Dec 22, 2022

Cleaning the labels right after a short-living instance is dropped/finished will cause the corresponding metric value to be removed before it is collected. For example, a batch task can take only 1 sec, but Prometheus may collect once every 20s. Therefore, the stale labels can only be removed after they are collected, which means we need to implement our own Collector to be aware of the invocation of collect(). In #5770, I call reset() on all the metrics after they are collected, but it has a race condition that if data is written between collect() and reset(), it will be lost, tho it happens very rarely.

@wenym1 wenym1 modified the milestones: release-0.1.16, release-0.1.17 Jan 30, 2023
@hzxa21 hzxa21 modified the milestones: release-0.1.17, release-0.1.18 Feb 20, 2023
@fuyufjh fuyufjh removed this from the release-0.18 milestone Mar 22, 2023
@lmatz
Copy link
Contributor

lmatz commented Mar 22, 2023

Give an example:

image (2)

The top ones come from a streaming query that has been dropped. But the metrics are kept.

@fuyufjh
Copy link
Member

fuyufjh commented Apr 25, 2023

Need to call remove_label_values when actor is dropped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants