Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for the monitoring stack #198

Merged
merged 5 commits into from
Oct 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/src/installation/monitoring-stack.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
97 changes: 96 additions & 1 deletion docs/src/installation/monitoring.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,98 @@
# Monitoring the metal-stack

We are currently working on providing the sources of our monitoring deployment for public usage. Please come back later.
## Overview

![Monitoring Stack](monitoring-stack.svg)

## Logging

Logs are being collected by
[Promtail](https://grafana.com/docs/loki/latest/send-data/promtail/) and pushed
to a [Loki](https://grafana.com/docs/loki/latest/) instance running in the
control plane. Loki is deployed in
[monolithic mode](https://grafana.com/docs/loki/latest/setup/install/helm/install-monolithic/)
and with storage type `'filesystem'`. You can find all logging related
configuration parameters for the control plane in the control plane's
[logging](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/logging/README.md)
role.

In the partitions, Promtail is deployed inside a systemd-managed Docker
container. Configuration parameters can be found in the partition's
[promtail](https://github.com/metal-stack/metal-roles/blob/master/partition/roles/promtail/README.md)
role. Which hosts Promtail collects from can be configured via the
`prometheus_promtail_targets` variable.

## Monitoring

For monitoring we deploy the
[kube-prometheus-stack](https://github.com/prometheus-operator/kube-prometheus)
and a [Thanos](https://thanos.io/tip/thanos/getting-started.md/) instance in the
control plane. Metrics for the control plane are supplied by

- `metal-metrics-exporter`
- `rethindb-exporter`
- `event-exporter`
- `gardener-metrics-exporter`

To query and visualize logs, metrics and alerts we deploy several grafana
dashboards to the control plane:

- `grafana-dashboard-alertmanager`
- `grafana-dashboard-machine-capacity`
- `grafana-dashboard-metal-api`
- `grafana-dashboard-rethinkdb`
- `grafana-dashboard-sonic-exporter`

and also some gardener related dashboards:

- `grafana-dashboard-gardener-overview`
- `grafana-dashboard-shoot-cluster`
- `grafana-dashboard-shoot-customizations`
- `grafana-dashboard-shoot-details`
- `grafana-dashboard-shoot-states`

The following `ServiceMonitors` are also deployed:

- `gardener-metrics-exporter`
- `ipam-db`
- `masterdata-api`
- `masterdata-db`
- `metal-api`
- `metal-db`
- `rethinkdb-exporter`
- `metal-metrics-exporter`

All monitoring related configuration parameters for the control plane can be
found in the control plane's
[monitoring](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/monitoring/README.md)
role.

Partition metrics are supplied by

- `node-exporter`
- `blackbox-exporter`
- `ipmi-exporter`
- `sonic-exporter`
- `metal-core`
- `frr-exporter`

and scraped by Prometheus. For each of these exporters, the target hosts can be
defined by

- `prometheus_node_exporter_targets`
- `prometheus_blackbox_exporter_targets`
- `prometheus_frr_exporter_targets`
- `prometheus_sonic_exporter_targets`
- `prometheus_metal_core_targets`
- `prometheus_frr_exporter_targets`

## Alerting

In addition to Grafana, alerts can optionally be sent to a
[Slack](https://slack.com/) channel. For this to work, at least a valid
`monitoring_slack_api_url` and a `monitoring_slack_notification_channel` must be
specified. For further configuration parameters refer to the
[monitoring](https://github.com/metal-stack/metal-roles/tree/master/control-plane/roles/monitoring)
role. Alerting rules are defined in the
[rules](https://github.com/metal-stack/metal-roles/tree/master/partition/roles/monitoring/prometheus/files/rules)
directory of the partition's prometheus role.
Loading