-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase retention time and add disk full alert back #848
Conversation
FYI: @nkinkade |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r1.
Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @cristinaleonr)
config/federation/prometheus/alerts.yml, line 1212 at r1 (raw file):
# PlatformCluster_PrometheusPersistentDiskTooFull fires when the persistent # disk mounted on the Prometheus VM gets too full (less than 5% free). - alert: PlatformCluster_PrometheusPersistentDiskTooFull
The alert prefix was originally to identify the origin of the alert (i.e. from the platform cluster instance of prometheus).
In this case, we want this alert to use metrics from the prometheus federation cluster as well as metrics from the data processing cluster.
There are two independent strategies.
-
(quick) When prometheus-federation collects metrics from other instances of prometheus, the set of metrics must be specified in the configuration i.e. see: https://github.com/m-lab/prometheus-support/blob/master/config/federation/prometheus/prometheus.yml.template#L383 and https://github.com/m-lab/prometheus-support/blob/master/config/federation/prometheus/prometheus.yml.template#L302
Add the metrics used below to the federation request from the data-processing cluster.
-
(complete) Add an alerts configuration to the data processing instance of prometheus that includes it's own version of this alert. This would be consistent with the architecture where each prometheus instance manages its own alerts.
For this case, I recommend adding a TODO for number 2. And, proceed with 1. As part of the SLI redesign for gardener, that may be a more natural opportunity to move alert processing to the local cluster.
config/federation/prometheus/alerts.yml, line 1214 at r1 (raw file):
- alert: PlatformCluster_PrometheusPersistentDiskTooFull expr: | node_filesystem_avail_bytes{cluster="platform-cluster", node="prometheus-platform-cluster", mountpoint="/mnt/local"}
Ideally, the cluster / node labels would be generic. Something to identify the prometheus-federation and data-processing instances.
config/federation/prometheus/alerts.yml, line 1229 at r1 (raw file):
a running VM. Please refer to the [instructions on how to do this][1]. [1]: https://github.com/m-lab/k8s-support/blob/master/manage-cluster/PROMETHEUS.md#resizing-the-prometheus-vms-disk dashboard: https://grafana.mlab-oti.measurementlab.net/d/sVklmeHik/prometheus-self-monitoring?orgId=1&var-datasource=Platform%20Cluster%20(mlab-oti)
Please update the dashboard link to refer to the relevant instance of prometheus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @cristinaleonr)
config/federation/prometheus/alerts.yml, line 1219 at r1 (raw file):
labels: repo: ops-tracker severity: ticket
When I was moving platform cluster alerts out of the prometheus-federation config, I also added a new label named cluster
to every alert (for both the platform cluster and prometheus-federation). For consistency, you should add this label:
cluster: prometheus-federation
.... and similarly for whatever alert will handle this case for data-processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @nkinkade and @stephen-soltesz)
config/federation/prometheus/alerts.yml, line 1212 at r1 (raw file):
Previously, stephen-soltesz (Stephen Soltesz) wrote…
The alert prefix was originally to identify the origin of the alert (i.e. from the platform cluster instance of prometheus).
In this case, we want this alert to use metrics from the prometheus federation cluster as well as metrics from the data processing cluster.
There are two independent strategies.
(quick) When prometheus-federation collects metrics from other instances of prometheus, the set of metrics must be specified in the configuration i.e. see: https://github.com/m-lab/prometheus-support/blob/master/config/federation/prometheus/prometheus.yml.template#L383 and https://github.com/m-lab/prometheus-support/blob/master/config/federation/prometheus/prometheus.yml.template#L302
Add the metrics used below to the federation request from the data-processing cluster.
(complete) Add an alerts configuration to the data processing instance of prometheus that includes it's own version of this alert. This would be consistent with the architecture where each prometheus instance manages its own alerts.
For this case, I recommend adding a TODO for number 2. And, proceed with 1. As part of the SLI redesign for gardener, that may be a more natural opportunity to move alert processing to the local cluster.
Okay, I understand now. I thought we just had to bring the old migrated code back.
I removed the alert prefix, added the metrics for the alert under the data-processing cluster configuration, and added a TODO for strategy 2.
config/federation/prometheus/alerts.yml, line 1214 at r1 (raw file):
Previously, stephen-soltesz (Stephen Soltesz) wrote…
Ideally, the cluster / node labels would be generic. Something to identify the prometheus-federation and data-processing instances.
Changed the clusters.
config/federation/prometheus/alerts.yml, line 1219 at r1 (raw file):
Previously, nkinkade wrote…
When I was moving platform cluster alerts out of the prometheus-federation config, I also added a new label named
cluster
to every alert (for both the platform cluster and prometheus-federation). For consistency, you should add this label:
cluster: prometheus-federation
.... and similarly for whatever alert will handle this case for data-processing.
Great, I added the cluster: prometheus-federation
label.
Technically, this alert is for both prometheus-federation and data-processing. I can add data-processing here, too, if there is some syntax to add two values for a label.
config/federation/prometheus/alerts.yml, line 1229 at r1 (raw file):
Previously, stephen-soltesz (Stephen Soltesz) wrote…
Please update the dashboard link to refer to the relevant instance of prometheus.
I changed the dashboard link to use the default datasource (now prometheus-federation), since it will only allow one datasource selection at a time. Alternatively, we can also change the dashboard to allow multiple datasources at the same time (e.g., prometheus-federation and data-processing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r2.
Reviewable status: complete! 1 of 1 approvals obtained (waiting on @nkinkade)
config/federation/prometheus/alerts.yml, line 1219 at r1 (raw file):
Previously, cristinaleonr (Cristina Leon) wrote…
Great, I added the
cluster: prometheus-federation
label.Technically, this alert is for both prometheus-federation and data-processing. I can add data-processing here, too, if there is some syntax to add two values for a label.
Until strategy 2 is complete, the origin if the alert will be correct with the current label. So this lgtm.
config/federation/prometheus/alerts.yml, line 1229 at r1 (raw file):
Previously, cristinaleonr (Cristina Leon) wrote…
I changed the dashboard link to use the default datasource (now prometheus-federation), since it will only allow one datasource selection at a time. Alternatively, we can also change the dashboard to allow multiple datasources at the same time (e.g., prometheus-federation and data-processing).
Ah, okay that's probably good. I was thinking of https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/#templating -- but I don't think we have the right label in the alert to refer to and place in the dashboard url... that's an aspiration for the future.
This change is