Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for applications with low traffic / uninitialized counters #1259

Open
fstr opened this issue Aug 30, 2024 · 0 comments
Open

Support for applications with low traffic / uninitialized counters #1259

fstr opened this issue Aug 30, 2024 · 0 comments

Comments

@fstr
Copy link

fstr commented Aug 30, 2024

Scenario

  • I have an application with low traffic
  • I either can't initialize all the counters in code because I have no control over it, or the counters have such a high cardinality that it is not feasible to initialize all of them when the application starts

Let's assume that I want an SLO on mymetric_request_count{rpc="myendpoint"}. When the application starts, this counter is not initialized, so it is not reported on the application's /metrics endpoint. It will only be reported, once the application has traffic on the myendpoint rpc endpoint in this example.

Problem

This will lead to no data being reported and the SLOMetricAbsent alert will fire. I can disable the absent alert via a property on the SLO. But the burnrate expressions will evaluate to no data.

The SLO definition

---
apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: test-slo
  namespace: {{ .Release.Namespace }}
spec:
  target: "99.9"
  window: 4w
  description: Test SLO
  indicator:
    ratio:
      errors:
        metric: mymetric_request_count{rpc="myendpoint", code=~"Internal|UNKNOWN"}
      total:
        metric: mymetric_request_count{rpc="myendpoint"}
      grouping:
        - rpc

Pyrra will generate the following absent rule. As the metric is not reported after an application restart, because the counter is not initialized, the absent error will fire (as described, this can be disabled). The for duration is calculated based on the window and is not configurable at the moment.

- alert: SLOMetricAbsent
  expr: absent(mymetric_request_count{rpc="myendpoint"}) == 1
  for: 2m

Pyrra will generate the following recording rule (out of many):

- expr: |
     sum by (rpc) (rate(mymetric_request_count{rpc="myendpoint", code=~"Internal|UNKNOWN"}[5m]))
     /
     sum by (rpc) (rate(mymetric_request_count{rpc="myendpoint"}[5m]))

Known mitigations

  • a) Initialize the counters in the code (which in my scenario I can't)
  • b) Create recording rules for the error and the total metric and pass these into your SLO, as the SLO only accepts Vector Selectors. These recording rules will contain workarounds to solve the problem of the uninitialized counters.

The recording rule can look like this. This will report 0 if the application is up, but the counter is not initialized.

mymetric_request_count{rpc="myendpoint", code=~"Internal|UNKNOWN"} or up{job="myjob"} * 0

The recording rule only helps with low cardinality metrics, as I can know all of the label values in advance. If I have high cardinality or potentially unknown label values (imagine customer IDs), then the or clause must be added dynamically, not in advance.

Request / Question

Are you interested in adding support for a customized or hardcoded or clause to help with uninitialized Prometheus counters?

This could be:

  • or up{job="myjob"} * 0
  • or vector(0) (will always report 0, even if the application is not running)

These could be set via a toggle on the SLO or fully customizable PromQL (which I know you try to avoid, as you would like to keep it simple).

This pattern is so far only used in the pyrra_availability metric (link)

@fstr fstr changed the title Improve support for applications with low traffic Support for applications with low traffic / uninitialized counters Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant