Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vmagent loses some metrics because it doesn't push them on shutdown #67

Open
andrii-dovzhenko opened this issue Jan 22, 2024 · 5 comments

Comments

@andrii-dovzhenko
Copy link

andrii-dovzhenko commented Jan 22, 2024

Description

We noticed that some metrics are not pushed randomly. After some debugging, we found out that this only happens when vmagent is running for a short period of time and cannot push all the metrics because some of them were created between the last scrape and shutdown.
The metrics appear in the input file, but they are not sent to the -remoteWrite.url endpoint.

A possible solution might be to change the code here

metrics/push.go

Lines 236 to 242 in fdfd428

case <-stopCh:
if wg != nil {
wg.Done()
}
return
}
}

to push the metrics on shutdown

To reproduce

Use vmagent in an environment with a short life cycle.

Version

vmagent-20230313-021802-tags-v1.89.1-0-g388d6ee16
But it doesn't really matter since the same problem exists even in the last version of vmagent

@Veetaha
Copy link

Veetaha commented Jan 22, 2024

Maybe there could be some way to signal to the vmagent that we are about to shutdown, and it needs to do the final scrape and push the metrics?

This is the use case for a variable-life-length process that stores metrics in a file and vmagent scrapes that file periodically and pushes them to a remote write URL. Once that process shutdowns we then shutdown the vmagent, but we expect vmagent to make sure it scrapes the metrics file for the last time and does the ultimate push.

This problem reproduces when the process that produces the metrics turns out to be short-lived (e.g. it fails fast, but still produces some useful metrics). In this case it is likely that the scrape interval of the vmagent may not coincide with the time when the process pushed the metrics to it, and thus vmagent would never see and push the metrics.

@anelson
Copy link

anelson commented Jan 22, 2024

EDITED FOR CLARITY.

@andrii-dovzhenko and I work in the same company. I thought I addressed this issue several months ago by making sure vmagent is sent a SIGINT signal before we shutdown the telemetry infrastructure, so that any buffered telemetry is flushed to the configured remote endpoint.

See here:

https://github.com/elastio/elastio/blob/edb2ae7795849fca523a674d2296bb498ff2cf44/docker/elastio-red-stack-base/supervised.sh#L60-L64

(This link is to an internal Elastio repo; sorry for this. Basically it's a script that sends SIGINT to vmagent before we shut down the rest of the telemetry infrastructure.)

The stop signal is configured in the respective vmagent and promtail config files as INT, meaning SIGINT. AFAIK, that signal should force vmagent and promtail to flush their buffers and then exit. These two are stopped explicitly so that nifmet and naflog are still running to receive their flushed results.

It sounds like perhaps this isn't working.

Is it expected behavior that vmagent flushes its buffers and writes all metrics to the configured server in response to a SIGINT signal?

@cristaloleg
Copy link

Link is 404 (looks like this's a private repo 👀 )

@anelson
Copy link

anelson commented Jan 22, 2024

Sorry @cristaloleg that is indeed a private repo. That comment was directed to @andrii-dovzhenko. I'll reword the comment to make it more clear.

@andrii-dovzhenko
Copy link
Author

@anelson, the order of stopping the services is correct, the problem is that vmagent does not flush its buffer on SIGINT as we can see in the code snippet I left in the issue description and it is not fixed in vmagent yet and nifmet doesn't receive the metrics that were created after the last scrape performed by vmagent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants