Skip to content

Commit

Permalink
doc: OVN monitoring introduction (rackerlabs#392)
Browse files Browse the repository at this point in the history
JIRA:OSPC-552
  • Loading branch information
awfabian-rs authored Aug 8, 2024
1 parent 800ac84 commit 3135efc
Show file tree
Hide file tree
Showing 2 changed files with 186 additions and 0 deletions.
185 changes: 185 additions & 0 deletions docs/ovn-monitoring-introduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
# Background

- This page describes OVN monitoring in _Genestack_.
- Most OVN monitoring in _Genestack_ comes from _Kube-OVN_'s model
- As the _Kube-OVN_ documentation indicates, this includes both:
- control plane information
- data plane network quality information
- _Kube-OVN_ has
[instrumention](https://prometheus.io/docs/practices/instrumentation/)
for _Promethus_
- so _Genestack_ documentation directs installing the k8s _ServiceMonitors_
so that _Promethus_ can discover these metrics.

## Links

- [_Genestack_ documentation on installing _Kube-OVN_ monitoring](./prometheus-kube-ovn.md)
- As mentioned above, this simply installs the _ServiceMonitors_ so that
_Prometheus_ in _Genestack_ can discover the metrics exported by _Kube-OVN_.
- you can see the _ServiceMonitors_ installed
[here](https://github.com/rackerlabs/genestack/tree/main/base-kustomize/prometheus-ovn)
- in particular, it has _ServiceMonitors_ for components:
- _kube-ovn-cni_
- _kube-ovn-controller_
- _kube-ovn-monitor_
- _kube-ovn-pinger_

You can see a architectural descriptions of these components
[here](https://kubeovn.github.io/docs/stable/en/reference/architecture/#core-controller-and-agent)

- [_Kube-OVN User Guide's "Monitor and Dashboard"_ section](https://kubeovn.github.io/docs/stable/en/guide/prometheus-grafana/)
- the information runs a bit sparse in the User Guide; note the reference
manual link (in the User Guide itself, and next link here below) for more
detailed information on the provided metrics.
- [_Kube-OVN Reference Manual "Metrics"_](https://kubeovn.github.io/docs/stable/en/reference/metrics/)
- This describes the monitoring metrics provided by _Kube-OVN_

# Metrics

## Viewing the metrics

In a full _Genestack_ installation, you can view Prometheus metrics:

- by using _Prometheus_' UI
- by querying _Prometheus_' HTTPS API
- by using _Grafana_
- In particular, _Kube-OVN_ provides pre-defined Grafana dashboards
installed in _Genestack_.

Going in-depth on these would go beyond the scope of this document, but sections
below provide some brief coverage.

### Prometheus' data model

_Prometheus_' data model and design-for-scale tends to make interactive
[_PromQL_](https://prometheus.io/docs/prometheus/latest/querying/basics/)
queries cumbersome. In general usage, you will find that _Prometheus_ data works
better for feeding into other tools, like the _Alertmanager_ for alerting, and
_Grafana_ for visualization.

### Prometheus UI

A full _Genestack_ installation includes the _Prometheus UI_. The _Prometheus_
UI prominently displays a search bar that takes _PromQL_ expressions.

You can easily see the available _Kube-OVN_ metrics by opening the Metrics
Explorer (click the globe icon) and typing `kube_ovn_`.

While this has some limited utility for getting a low-level view of individual
metrics, you will generally find it more useful to look at the Grafana
dashboards as described below.

As mentioned above, the _Kube-OVN_ documentation details the collected metrics
[here](https://kubeovn.github.io/docs/stable/en/reference/metrics)

### Prometheus API

You will probably need a strong understanding of the Prometheus data model and
_PromQL_ to use the _Prometheus_ API directly, and will likely find little use
for using the API interactively.

However, where you have a working `kubectl`, you can do something like the
following to use `curl` on the Prometheus API with minor adaptations for your
installation:

```
# Run kubectl proxy in the background
kubectl proxy &
# You will probably find the -g/--globoff option to curl useful to stop curl
# itself from interpreting the characters {} [] in URLs.
#
# Additionally, these characters technically require escaping in URLs, so you
# might want to use --data-urlencode
curl -sS -gG \
http://localhost:8001/api/v1/namespaces/prometheus/services/prometheus-operated:9090/proxy/api/v1/query \
--data-urlencode 'query=kube_ovn_ovn_status' | jq .
```

### Grafana

As mentioned previously, _Kube-OVN_ provides various dashboards with
information on both the control plane and the data plane network quality.

These dashboards contain a lot of information, so in many cases, you will likely
use them for troubleshooting by expanding to a large timeframe to see when
irregularities may have started occurring.

You can see the documentation from _Kube-OVN_ on these dashboards
[here](https://kubeovn.github.io/docs/stable/en/guide/prometheus-grafana/)

Some additional details on each of these dashboards follows.

#### Controller dashboard

In a typical _Genestack_ installation, you should see 3 controllers up here.
The dashboard displays the number of up controllers prominently.

The graphs show information about the `kube-ovn-controller` pods in the
`kube-system` namespace, mostly identified by their ClusterIP on the k8s
service network. You can typically see them along with the ClusterIPs that
identify them individually on the dashboard like:

```
kubectl -n kube-system get pod -l app=kube-ovn-controller -o wide
```

#### Kube-OVN-CNI dashboard

In this case, CNI refers to k8s' _container network interface_ which allows
k8s to use various plugins (such as _Kube-OVN_) for cluster networking, as
described [here](https://kubeovn.github.io/docs/stable/en/reference/architecture/#kube-ovn-cni)

Like the Controller dashboard, it displays the number of pods up prominently.
It should have 1 pod for each k8s node, so it should match the count of your
k8s nodes:

```
# tail to skip header lines
kubectl get node | tail -n +2 | wc -l
```

These metrics belong to the 'control plane' metrics, and this dashboard will
probably work well by using a large timeframe to find anomalous behavior as
previously described.

#### OVN dashboard

This dashboard displays some metrics from the OVN, like the number of logical
switches and logical ports. It shows a chassis count that should match the
number of nodes in your cluster, and a flag (or technically a count, but you
will see "1") for "OVN DB Up".

This dashboard looks useful as previously described for looking for anomalous
behavior that emerged across a timeframe.

#### OVS dashboard

This dashboard displays information from OVS from each k8s node.

It sometimes uses names, but sometimes pod IPs.

OVS activity across nodes might not necessarily have a strong correlation, so
on this dashboard, you might take particular note that you can click the node
of interest on the legend for each graph, or choose a particular instance for
all of the graphs at the top.

You may need to collate the ClusterIPs with a node, which you can do with
something like:

```
kubectl get node -n kube-system -o wide | grep 10.10.10.10
```

which will display the name of the node with the pinger pod.
[kube-ovn-pinger](https://kubeovn.github.io/docs/stable/en/reference/architecture/#kube-ovn-pinger)
collects OVS status information, so while you see the `pinger` pod when checking
the IP, the information from the dashboard may *come from* the pinger pod, but
*pertains to* OVS, *not* the pinger pod.

#### Pinger dashboard

This dashboard prominently displays the OVS up count, OVN-Controller up count,
and API server accessible count. These should all have the same number, equal to
your number of nodes.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,7 @@ nav:
- OVN troubleshooting: ovn-troubleshooting.md
- OVN traffic flow introduction: ovn-traffic-flow-intro.md
- OVN Database Backup: infrastructure-ovn-db-backup.md
- OVN Monitoring introduction: ovn-monitoring-introduction.md
- MariaDB Operations: infrastructure-mariadb-ops.md
- Third Party Tools:
- OSIE: extra-osie.md
Expand Down

0 comments on commit 3135efc

Please sign in to comment.