Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A better "valid use case" section for Pushgateway #2442

Open
iNishant opened this issue Mar 14, 2024 · 2 comments
Open

A better "valid use case" section for Pushgateway #2442

iNishant opened this issue Mar 14, 2024 · 2 comments

Comments

@iNishant
Copy link

Currently, the docs here mention https://prometheus.io/docs/practices/pushing/#should-i-be-using-the-pushgateway

Usually, the only valid use case for the Pushgateway is for capturing the outcome of a service-level batch job. A "service-level" batch job is one which is not semantically related to a specific machine or job instance (for example, a batch job that deletes a number of users for an entire service). Such a job's metrics should not include a machine or instance label to decouple the lifecycle of specific machines or instances from the pushed metrics."

Its hard (at least personally and maybe for others) to infer a common use case of Pushgateway from the above paragraph which is to push metrics from a machine, as the metrics change, because the machine itself will not be available for scraping after the job, because its configured to be deleted after the job completes. Also, the example job a batch job that deletes a number of users for an entire service doesn't feel like the best example because its easily possible to implement this job in a way that it runs on a machine, which is not deleted after the job and where the metrics can be scraped normally (all machine/instance labels can be ignored).

For eg an alternate "valid use case" could be

Imagine you have a ML model training job for which your system spawns a container to run the job. Your system is configured to delete the container after the job completes (to save on cost/resources). Now imagine prometheus scraping this container, its possible some metrics or their latest values are not scraped because the container itself disappears after the job completes.

Do folks feel the same?

@aivachan
Copy link

aivachan commented Sep 30, 2024

I want to chime in with another use case that doesn't fall into the "batch jobs" category. If you reach a case where it is appropriate for your service to panic (and the reasons for the panic are known with low cardinality), you may want metrics on that, but it's unlikely the service will get scraped before the process exits.

I can't think of another way to achieve this other than PushGateway, but please let me know if there's something I'm missing. I agree with the general dissuasion from using PushGateway, but maybe this case would be good to throw in the docs as well.

@beorn7
Copy link
Member

beorn7 commented Oct 2, 2024

In any case where you want some continuous metrics collection, I would not recommend the Pushgateway. "Ensuring that you get a final update of the metrics before exiting" is not what the PGW was made for. For one, it doesn't really scale to a lot of metrics, but more importantly, it has no HA concept and is a rather unreliable way to "drop a metric somewhere". That's more or less "fine" for a simple batch job, where you mostly want to know if it finished, and if so with what result. If something goes wrong on the PGW side, you can alert on the absence or staleness of the metric and can easily fix things. However, once you actually depend on detailed metrics to diagnose a problem more deeply, I would go for better collection mechanisms. Prometheus's scrape approach is very robust and scalable, but naturally isn't well suited for "let me send this last update before exiting" scenarios. If you cannot keep a job running for long enough, I would consider something else (like an actual push approach, which you could even shoehorn into Prometheus via remote-write, although that has other caveats; or generally something more event/logging based).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants