Increase retention time and add disk full alert back #848

cristinaleonr · 2021-10-25T13:44:55Z

Add the PlatformCluster_PrometheusPersistentDiskTooFull alert back to the prometheus-federation config after it was removed during the migration to k8s-support. Prometheus disk full alerts needed for prometheus-federation cluster #847.
Increase retention time to 120 days.

This change is

stephen-soltesz · 2021-10-25T18:30:46Z

stephen-soltesz

Reviewed 1 of 2 files at r1.
Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @cristinaleonr)

config/federation/prometheus/alerts.yml, line 1212 at r1 (raw file):

# PlatformCluster_PrometheusPersistentDiskTooFull fires when the persistent
# disk mounted on the Prometheus VM gets too full (less than 5% free).
  - alert: PlatformCluster_PrometheusPersistentDiskTooFull

The alert prefix was originally to identify the origin of the alert (i.e. from the platform cluster instance of prometheus).

In this case, we want this alert to use metrics from the prometheus federation cluster as well as metrics from the data processing cluster.

There are two independent strategies.

(quick) When prometheus-federation collects metrics from other instances of prometheus, the set of metrics must be specified in the configuration i.e. see: https://github.com/m-lab/prometheus-support/blob/master/config/federation/prometheus/prometheus.yml.template#L383 and https://github.com/m-lab/prometheus-support/blob/master/config/federation/prometheus/prometheus.yml.template#L302

Add the metrics used below to the federation request from the data-processing cluster.
(complete) Add an alerts configuration to the data processing instance of prometheus that includes it's own version of this alert. This would be consistent with the architecture where each prometheus instance manages its own alerts.

For this case, I recommend adding a TODO for number 2. And, proceed with 1. As part of the SLI redesign for gardener, that may be a more natural opportunity to move alert processing to the local cluster.

config/federation/prometheus/alerts.yml, line 1214 at r1 (raw file):

  - alert: PlatformCluster_PrometheusPersistentDiskTooFull
    expr: |
      node_filesystem_avail_bytes{cluster="platform-cluster", node="prometheus-platform-cluster", mountpoint="/mnt/local"}

Ideally, the cluster / node labels would be generic. Something to identify the prometheus-federation and data-processing instances.

config/federation/prometheus/alerts.yml, line 1229 at r1 (raw file):

        a running VM. Please refer to the [instructions on how to do this][1].
        [1]: https://github.com/m-lab/k8s-support/blob/master/manage-cluster/PROMETHEUS.md#resizing-the-prometheus-vms-disk
      dashboard: https://grafana.mlab-oti.measurementlab.net/d/sVklmeHik/prometheus-self-monitoring?orgId=1&var-datasource=Platform%20Cluster%20(mlab-oti)

Please update the dashboard link to refer to the relevant instance of prometheus.

nkinkade

Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @cristinaleonr)

config/federation/prometheus/alerts.yml, line 1219 at r1 (raw file):

    labels:
      repo: ops-tracker
      severity: ticket

When I was moving platform cluster alerts out of the prometheus-federation config, I also added a new label named cluster to every alert (for both the platform cluster and prometheus-federation). For consistency, you should add this label:

cluster: prometheus-federation

.... and similarly for whatever alert will handle this case for data-processing.

cristinaleonr

Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @nkinkade and @stephen-soltesz)

config/federation/prometheus/alerts.yml, line 1212 at r1 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

The alert prefix was originally to identify the origin of the alert (i.e. from the platform cluster instance of prometheus).

In this case, we want this alert to use metrics from the prometheus federation cluster as well as metrics from the data processing cluster.

There are two independent strategies.

(quick) When prometheus-federation collects metrics from other instances of prometheus, the set of metrics must be specified in the configuration i.e. see: https://github.com/m-lab/prometheus-support/blob/master/config/federation/prometheus/prometheus.yml.template#L383 and https://github.com/m-lab/prometheus-support/blob/master/config/federation/prometheus/prometheus.yml.template#L302

Add the metrics used below to the federation request from the data-processing cluster.

(complete) Add an alerts configuration to the data processing instance of prometheus that includes it's own version of this alert. This would be consistent with the architecture where each prometheus instance manages its own alerts.

For this case, I recommend adding a TODO for number 2. And, proceed with 1. As part of the SLI redesign for gardener, that may be a more natural opportunity to move alert processing to the local cluster.

Okay, I understand now. I thought we just had to bring the old migrated code back.

I removed the alert prefix, added the metrics for the alert under the data-processing cluster configuration, and added a TODO for strategy 2.

config/federation/prometheus/alerts.yml, line 1214 at r1 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Ideally, the cluster / node labels would be generic. Something to identify the prometheus-federation and data-processing instances.

Changed the clusters.

config/federation/prometheus/alerts.yml, line 1219 at r1 (raw file):

Previously, nkinkade wrote…

When I was moving platform cluster alerts out of the prometheus-federation config, I also added a new label named cluster to every alert (for both the platform cluster and prometheus-federation). For consistency, you should add this label:

cluster: prometheus-federation

.... and similarly for whatever alert will handle this case for data-processing.

Great, I added the cluster: prometheus-federation label.

Technically, this alert is for both prometheus-federation and data-processing. I can add data-processing here, too, if there is some syntax to add two values for a label.

config/federation/prometheus/alerts.yml, line 1229 at r1 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Please update the dashboard link to refer to the relevant instance of prometheus.

I changed the dashboard link to use the default datasource (now prometheus-federation), since it will only allow one datasource selection at a time. Alternatively, we can also change the dashboard to allow multiple datasources at the same time (e.g., prometheus-federation and data-processing).

stephen-soltesz

Reviewed 2 of 2 files at r2.
Reviewable status: complete! 1 of 1 approvals obtained (waiting on @nkinkade)

config/federation/prometheus/alerts.yml, line 1219 at r1 (raw file):

Previously, cristinaleonr (Cristina Leon) wrote…

Great, I added the cluster: prometheus-federation label.

Technically, this alert is for both prometheus-federation and data-processing. I can add data-processing here, too, if there is some syntax to add two values for a label.

Until strategy 2 is complete, the origin if the alert will be correct with the current label. So this lgtm.

config/federation/prometheus/alerts.yml, line 1229 at r1 (raw file):

Previously, cristinaleonr (Cristina Leon) wrote…

I changed the dashboard link to use the default datasource (now prometheus-federation), since it will only allow one datasource selection at a time. Alternatively, we can also change the dashboard to allow multiple datasources at the same time (e.g., prometheus-federation and data-processing).

Ah, okay that's probably good. I was thinking of https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/#templating -- but I don't think we have the right label in the alert to refer to and place in the dashboard url... that's an aspiration for the future.

Increase retention time and add disk full alerts back

070bce7

cristinaleonr requested a review from stephen-soltesz October 25, 2021 14:01

stephen-soltesz requested changes Oct 25, 2021

View reviewed changes

nkinkade reviewed Oct 25, 2021

View reviewed changes

cristinaleonr and others added 2 commits October 26, 2021 17:57

Use prometheus-federation and data-processing clusters for alert

25b1bf0

Merge branch 'master' into sandbox-cristinaleon-retention-time

36c14e0

cristinaleonr commented Oct 26, 2021

View reviewed changes

cristinaleonr changed the title ~~Increase retention time and add disk full alerts back~~ Increase retention time and add disk full alert back Oct 26, 2021

stephen-soltesz approved these changes Oct 26, 2021

View reviewed changes

cristinaleonr merged commit 7005fec into master Oct 26, 2021

cristinaleonr deleted the sandbox-cristinaleon-retention-time branch October 26, 2021 19:11

cristinaleonr added the on-call label Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase retention time and add disk full alert back #848

Increase retention time and add disk full alert back #848

cristinaleonr commented Oct 25, 2021 •

edited

Loading

stephen-soltesz commented Oct 25, 2021

stephen-soltesz left a comment

nkinkade left a comment

cristinaleonr left a comment

stephen-soltesz left a comment

Increase retention time and add disk full alert back #848

Increase retention time and add disk full alert back #848

Conversation

cristinaleonr commented Oct 25, 2021 • edited Loading

stephen-soltesz commented Oct 25, 2021

stephen-soltesz left a comment

Choose a reason for hiding this comment

nkinkade left a comment

Choose a reason for hiding this comment

cristinaleonr left a comment

Choose a reason for hiding this comment

stephen-soltesz left a comment

Choose a reason for hiding this comment

cristinaleonr commented Oct 25, 2021 •

edited

Loading