Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase retention time and add disk full alert back #848

Merged
merged 3 commits into from
Oct 26, 2021

Conversation

cristinaleonr
Copy link
Contributor

@cristinaleonr cristinaleonr commented Oct 25, 2021

  1. Add the PlatformCluster_PrometheusPersistentDiskTooFull alert back to the prometheus-federation config after it was removed during the migration to k8s-support. Prometheus disk full alerts needed for prometheus-federation cluster #847.
  2. Increase retention time to 120 days.

This change is Reviewable

@stephen-soltesz
Copy link
Contributor

FYI: @nkinkade

Copy link
Contributor

@stephen-soltesz stephen-soltesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r1.
Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @cristinaleonr)


config/federation/prometheus/alerts.yml, line 1212 at r1 (raw file):

# PlatformCluster_PrometheusPersistentDiskTooFull fires when the persistent
# disk mounted on the Prometheus VM gets too full (less than 5% free).
  - alert: PlatformCluster_PrometheusPersistentDiskTooFull

The alert prefix was originally to identify the origin of the alert (i.e. from the platform cluster instance of prometheus).

In this case, we want this alert to use metrics from the prometheus federation cluster as well as metrics from the data processing cluster.

There are two independent strategies.

  1. (quick) When prometheus-federation collects metrics from other instances of prometheus, the set of metrics must be specified in the configuration i.e. see: https://github.com/m-lab/prometheus-support/blob/master/config/federation/prometheus/prometheus.yml.template#L383 and https://github.com/m-lab/prometheus-support/blob/master/config/federation/prometheus/prometheus.yml.template#L302

    Add the metrics used below to the federation request from the data-processing cluster.

  2. (complete) Add an alerts configuration to the data processing instance of prometheus that includes it's own version of this alert. This would be consistent with the architecture where each prometheus instance manages its own alerts.

For this case, I recommend adding a TODO for number 2. And, proceed with 1. As part of the SLI redesign for gardener, that may be a more natural opportunity to move alert processing to the local cluster.


config/federation/prometheus/alerts.yml, line 1214 at r1 (raw file):

  - alert: PlatformCluster_PrometheusPersistentDiskTooFull
    expr: |
      node_filesystem_avail_bytes{cluster="platform-cluster", node="prometheus-platform-cluster", mountpoint="/mnt/local"}

Ideally, the cluster / node labels would be generic. Something to identify the prometheus-federation and data-processing instances.


config/federation/prometheus/alerts.yml, line 1229 at r1 (raw file):

        a running VM. Please refer to the [instructions on how to do this][1].
        [1]: https://github.com/m-lab/k8s-support/blob/master/manage-cluster/PROMETHEUS.md#resizing-the-prometheus-vms-disk
      dashboard: https://grafana.mlab-oti.measurementlab.net/d/sVklmeHik/prometheus-self-monitoring?orgId=1&var-datasource=Platform%20Cluster%20(mlab-oti)

Please update the dashboard link to refer to the relevant instance of prometheus.

Copy link
Contributor

@nkinkade nkinkade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @cristinaleonr)


config/federation/prometheus/alerts.yml, line 1219 at r1 (raw file):

    labels:
      repo: ops-tracker
      severity: ticket

When I was moving platform cluster alerts out of the prometheus-federation config, I also added a new label named cluster to every alert (for both the platform cluster and prometheus-federation). For consistency, you should add this label:

cluster: prometheus-federation

.... and similarly for whatever alert will handle this case for data-processing.

Copy link
Contributor Author

@cristinaleonr cristinaleonr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @nkinkade and @stephen-soltesz)


config/federation/prometheus/alerts.yml, line 1212 at r1 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

The alert prefix was originally to identify the origin of the alert (i.e. from the platform cluster instance of prometheus).

In this case, we want this alert to use metrics from the prometheus federation cluster as well as metrics from the data processing cluster.

There are two independent strategies.

  1. (quick) When prometheus-federation collects metrics from other instances of prometheus, the set of metrics must be specified in the configuration i.e. see: https://github.com/m-lab/prometheus-support/blob/master/config/federation/prometheus/prometheus.yml.template#L383 and https://github.com/m-lab/prometheus-support/blob/master/config/federation/prometheus/prometheus.yml.template#L302

    Add the metrics used below to the federation request from the data-processing cluster.

  2. (complete) Add an alerts configuration to the data processing instance of prometheus that includes it's own version of this alert. This would be consistent with the architecture where each prometheus instance manages its own alerts.

For this case, I recommend adding a TODO for number 2. And, proceed with 1. As part of the SLI redesign for gardener, that may be a more natural opportunity to move alert processing to the local cluster.

Okay, I understand now. I thought we just had to bring the old migrated code back.

I removed the alert prefix, added the metrics for the alert under the data-processing cluster configuration, and added a TODO for strategy 2.


config/federation/prometheus/alerts.yml, line 1214 at r1 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Ideally, the cluster / node labels would be generic. Something to identify the prometheus-federation and data-processing instances.

Changed the clusters.


config/federation/prometheus/alerts.yml, line 1219 at r1 (raw file):

Previously, nkinkade wrote…

When I was moving platform cluster alerts out of the prometheus-federation config, I also added a new label named cluster to every alert (for both the platform cluster and prometheus-federation). For consistency, you should add this label:

cluster: prometheus-federation

.... and similarly for whatever alert will handle this case for data-processing.

Great, I added the cluster: prometheus-federation label.

Technically, this alert is for both prometheus-federation and data-processing. I can add data-processing here, too, if there is some syntax to add two values for a label.


config/federation/prometheus/alerts.yml, line 1229 at r1 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Please update the dashboard link to refer to the relevant instance of prometheus.

I changed the dashboard link to use the default datasource (now prometheus-federation), since it will only allow one datasource selection at a time. Alternatively, we can also change the dashboard to allow multiple datasources at the same time (e.g., prometheus-federation and data-processing).

@cristinaleonr cristinaleonr changed the title Increase retention time and add disk full alerts back Increase retention time and add disk full alert back Oct 26, 2021
Copy link
Contributor

@stephen-soltesz stephen-soltesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r2.
Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @nkinkade)


config/federation/prometheus/alerts.yml, line 1219 at r1 (raw file):

Previously, cristinaleonr (Cristina Leon) wrote…

Great, I added the cluster: prometheus-federation label.

Technically, this alert is for both prometheus-federation and data-processing. I can add data-processing here, too, if there is some syntax to add two values for a label.

Until strategy 2 is complete, the origin if the alert will be correct with the current label. So this lgtm.


config/federation/prometheus/alerts.yml, line 1229 at r1 (raw file):

Previously, cristinaleonr (Cristina Leon) wrote…

I changed the dashboard link to use the default datasource (now prometheus-federation), since it will only allow one datasource selection at a time. Alternatively, we can also change the dashboard to allow multiple datasources at the same time (e.g., prometheus-federation and data-processing).

Ah, okay that's probably good. I was thinking of https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/#templating -- but I don't think we have the right label in the alert to refer to and place in the dashboard url... that's an aspiration for the future.

@cristinaleonr cristinaleonr merged commit 7005fec into master Oct 26, 2021
@cristinaleonr cristinaleonr deleted the sandbox-cristinaleon-retention-time branch October 26, 2021 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants