Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: Adding observability documentation #592

Merged
merged 1 commit into from
Nov 26, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Docs: Adding observability documentation
the2hill committed Nov 25, 2024
commit be42fcaa33c98dc85cc1e50780330aae521b86ea
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/project-lookup-example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
11 changes: 7 additions & 4 deletions docs/monitoring-info.md
Original file line number Diff line number Diff line change
@@ -25,7 +25,6 @@ To read more about Prometheus from the official source take a look at [Prometheu

To install Prometheus within the Genestack workflow we make use of the helm charts found in the kube-prometheus-stack.


## The Kube Prometheus Stack

Genestack takes full advantage of [Helm](https://helm.sh/) and [Kustomize maninfests](https://kustomize.io/) to build a production grade Kubernetes and OpenStack Cloud.
@@ -159,16 +158,20 @@ Currently, in Genestack the textfile-collector is used to collect kernel-taint s
This is currently the complete list of exporters and monitoring callouts deployed within the Genestack workflow. That said, Genestack is constantly evolving and list may grow or change entirely as we look to further improve our systems!
With all these metrics available we need a way to visualize them to get a better picture of our systems and their health, we'll discuss that next!


## Visualization

In Genestack we deploy [Grafana](https://grafana.com/) as our default visualization tool. Grafana is open-sourced tooling which aligns well with the Genestack ethos while providing seamless visualization of the metrics generated by our systems.
Grafana also plays well with Prometheus and [Loki](infrastructure-loki.md), the default logging tooling deployed in Genestacks workflow, with various datasource plugins making integration a breeze.
Installing [Grafana](grafana.md) within Genestack is fairly straight forward, just follow the [Grafana Deployment Doc](grafana.md).

As things stand now, the Grafana deployment does not deploy dashboards as part of the default deployment instructions. However, there are dashboards available found in the [etc directory](https://github.com/rackerlabs/genestack/tree/main/etc/grafana-dashboards) of the Genestack repo that can be installed manually.
The dashboards available cover just about every exporter/metric noted here and then some. Some of the dashboards may not be complete or may not provide the desired view. Please feel free to adjust them as needed and submit a PR to [Genestack repo](https://github.com/rackerlabs/genestack) if they may help others!
The installation also takes care of installing the primary [datasources](https://grafana.com/docs/grafana/latest/datasources/) which are Grafana plugins used for querying specific datasets.
For example in Genestack we install the [Prometheus and Loki datasources](https://github.com/rackerlabs/genestack/blob/8b90d22b19795acd364afb05f08617c326f6c8f6/base-helm-configs/grafana/datasources.yaml#L4) as part of Genestack's default workflow.
You can manually add additional datasources by following the [add datasource](https://grafana.com/docs/grafana/latest/datasources/#add-a-data-source) documentation.
More information about the primary datasources can be found in the [Prometheus datasource](https://grafana.com/docs/grafana/latest/datasources/prometheus/) and [Loki datasource](https://grafana.com/docs/grafana/latest/datasources/loki/) documentation.

As things stand now, the Grafana deployment does not deploy dashboards as part of the default deployment instructions. However, there are dashboards available found in the [etc directory](https://github.com/rackerlabs/genestack/tree/main/etc/grafana-dashboards) of the Genestack repo that can be installed manually by importing them into Grafana.
View the [importing dashboards](https://grafana.com/docs/grafana/latest/dashboards/build-dashboards/import-dashboards/) documentation for more information.
The dashboards available cover just about every exporter/metric noted here and then some. Some of the dashboards may not be complete or may not provide the desired view. Please feel free to adjust them as needed and submit a PR to [Genestack repo](https://github.com/rackerlabs/genestack) if they may help others!

## Next Steps

105 changes: 105 additions & 0 deletions docs/observability-info.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Genestack Observability

Genestack is made up of a vast array of components working away to provide a Kubernetes and OpenStack cloud infrastructure
to serve our needs. Here we'll discuss in a bit more detail about how we observe and visualize our Genestack operations.

## Overview

In this document we'll dive a bit deeper into Genestack observability by exploring the tooling deployed as part of the Genestack workflow that helps us monitor, alert, log and visualize metrics of our Genestack environment.

Observability is often described as the ability to gather data about complex systems via monitoring, logging and performance metrics to better understand the state of the system as a whole.
In modern systems, especially cloud computing, where there are many components and various services distributed across clusters and even regions observability plays a crucial role toward maintaining performance reliability and even security of your systems.
With a robust observability platform complex systems become manageable and provides various stakeholders the tools needed to forecast and predict potential issues before they arise, resolve and discover root cause of problems that do arise and provide better means of analyzing the health and growth of their environments.

Observability components used in Genestack that we'll discuss a bit further are as follows:

* Fluentbit and Loki
* Log collection and aggregation
* Prometheus
* Systems monitoring
* Alert Manager
* Alert aggregator and notification router
* Grafana
* Visualization platform

## Logging

Logging is key to better understanding the health and performance of your systems. Logging gives insights into system events, errors and even security concerns.
Logging in Genestack as part of its default workflow is handled by [Fluentbit](https://fluentbit.io/) and [Loki](https://grafana.com/oss/loki/) which takes care of collection, processing and aggregation of Genestack's system and service logs.
You can view the [Fluentbit](https://github.com/rackerlabs/genestack/tree/main/base-helm-configs/fluentbit) and [Loki](infrastructure-loki.md) installation docs to get an idea of how we're deploying it in the Genestack infrastructure.
You can view their source code at [Fluentbit Github](https://github.com/fluent/fluent-bit) and [Loki Github](https://github.com/grafana/loki/tree/main).

Fluentbit is deployed as the log collector, processor and forwarder. It is configured and deployed across the system to gather logs from Kubernetes pods and the various OpenStack services.
Once collected and processed it then forwards, or ships the logs off for consumption to Loki. Loki can then take care of long term storage as well as handling log accessability.
Loki can tag and label the logs for easy lookup using Loki [LogQL](https://grafana.com/docs/loki/latest/query/).

The logs can be queried via a [logcli](https://grafana.com/docs/loki/latest/query/logcli/) command line tool for lookups or as a [Loki Datasource](https://grafana.com/docs/grafana/latest/datasources/loki/) in Grafana.

An example that we use in our [project lookup](https://github.com/rackerlabs/genestack/blob/main/etc/grafana-dashboards/project_lookup.json) dashboard that allows us to query the logs for a specific service and project_id would look something like below.
!!! example "Example LokiQL lookup query"

```shell
{application="$service"} | logfmt | json | line_format "{{ .kubernetes_host}} {{.kubernetes_pod_name}} {{.log}}" |= `$project_id`
```
We can do something similar using the [logcli](https://grafana.com/docs/loki/latest/query/logcli/).

!!! example "Example logcli lookup query"

```shell
logcli-parallel --since=15m '{application=~"nova|placement"} |~ `<my-project-id-here>`' | jq -r '.log'
```
You can view more information about logging in Genestack at the [Logging Overview](genestack-logging.md) documentation page.

## Monitoring and Alerting with Prometheus

Monitoring and alerting are two crucial components for observability within the Genestack infrastructure.
By default, in Genestack we make use of Prometheus, an open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach.
As well as the AlertManager, a tooling that provides alert aggregation, grouping, deduplication and notification routing, which is conveniently packaged together with Prometheus in the [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack) for easy installation and configuration.
Prometheus and the related components fits Genestack open-source ethos and is easily integrated into Kubernetes and OpenStack systems and services. With easy means of installation, service discovery and configuration Prometheus is a top tier choice for the Genestack platform.

The below diagram shows how all these monitoring and alerting components tie together:
![Prometheus Architecture](assets/images/prometheus-architecture.png)

We have covered Prometheus, Prometheus alerting and the AlertManager in greater detail in the [Monitoring](monitoring-info.md) and [Alerting](alerting-info.md) documentation.

## Visualization

Now that we have the logging, monitoring, metrics and alerting portions of our observability platform mapped out we need a way to visualize all this data being provided.
For that we use [Grafana](https://grafana.com/) as our default visualization platform in Genestack. Grafana is an open-sourced, feature rich and highly pluggable visualization system that aligns well with Genestack.
Prometheus, Alertmanager and even Loki can easily plug right in and integrate with Grafana so that we can build out the visualization layer of our observability platform.

As noted in the [Prometheus Alerting](alerting-info.md) documentation we can configure alerts via Prometheus configurations and alert on any metric collected.
It's also possible to set up alerting through Grafana, see Grafana's [alerting docs](https://grafana.com/docs/grafana/latest/alerting/) for more details.

This comes in handy in the context of Loki and logs. Grafana with the [Loki datasource](https://grafana.com/docs/grafana/latest/datasources/loki/) allows us to configure alerts based on logging queries and the information returned.
One example in Genestack would be the [OVN Claimstorm alerts](ovn-alert-claim-storm.md). Below we can see an example of how this is configured.
![ovn claimstore alert](assets/images/loki-alerting-rules-example.png)

As noted above we can also use Loki and Grafana to display logs for our services. The following example and image shows what that would look like.
An example that we use in our [project lookup](https://github.com/rackerlabs/genestack/blob/main/etc/grafana-dashboards/project_lookup.json) dashboard that allows us to query the logs for a specific service and project_id would look something like below.
!!! example "Example LokiQL lookup query"

```shell
{application="$service"} | logfmt | json | line_format "{{ .kubernetes_host}} {{.kubernetes_pod_name}} {{.log}}" |= `$project_id`
```
![project lookup example](assets/images/project-lookup-example.png)

For additional information view the [Grafana](monitoring-info.md#visualization) portion of the [Monitoring Info](monitoring-info.md) documentation.

## Datadog

Fluentbit, Loki and Grafana makes for a powerful combination of log collection, aggregation and visualization and while these tools and the related components are the default choice in a Genestack deployment there are other solutions that may better suit your needs.
Genestack offers examples and basic configurations to deploy Grafana, Loki, Prometheus, etc... in a self-hosted and self-maintained manner which requires effort and costs to host and maintain the observability platfom and to store the logs.
This may not always be desirable and in such case something like [Datadog](https://www.datadoghq.com/) may be preferred to allieviate some of the burdens of hosting these solutions yourself.

[Datadog](https://www.datadoghq.com/) offers many of the features we've discussed in this documentation and much more via agents that you install and configure within your systems.

Datadog can work as a replacement or with our existing tools to form a hybrid approach for our observability platform.
There are plugins and agents that give you the flexibility you may desire. For example, [Fluentbit plugin](https://docs.datadoghq.com/integrations/fluentbit/) can act as the log collection service while instead of using the [Datadog logging agents](https://docs.datadoghq.com/containers/kubernetes/log/?tab=datadogoperator).
You can also use the [Prometheus Datadog plugins](https://docs.datadoghq.com/integrations/guide/prometheus-host-collection/) to provide additional metrics as you see fit.

An example of installing Datadog in [Rackspace Flex](https://www.rackspace.com/resources/rackspace-openstack-flex) can be found in the [Running Datadog on OpenStack Flex](https://blog.rackspacecloud.com/blog/2024/11/12/running_datadog_on_openstack-flex/#deploying-datadog-on-our-openstack-flex-server) blog post.
Integrating Datadog in your Genestack installation is just as simple and can be accommplished by installing various agents to fit your goals.
View the [Datadog Kubernetes](https://docs.datadoghq.com/containers/kubernetes/installation/?tab=datadogoperator) installation instructions for more information.

While Genestack provides a relatively comprehensive set of tooling and instructions for a production grade Kubernetes and OpenStack deployment, Datadog may be the right solution for your needs if you desire a little less hands on solution to your observability platform.
9 changes: 5 additions & 4 deletions mkdocs.yml
Original file line number Diff line number Diff line change
@@ -235,7 +235,6 @@ nav:
- Gnocchi: metering-gnocchi.md
- Billing Tenants: metering-billing.md
- Chargebacks: metering-chargebacks.md
- Logging Overview: genestack-logging.md
- Infrastructure:
- Kubernetes:
- Etcd Backup: etcd-backup.md
@@ -250,9 +249,11 @@ nav:
- Claim Storm alert: ovn-alert-claim-storm.md
- MariaDB:
- Operations: infrastructure-mariadb-ops.md
- Monitoring and Alerting:
- Monitoring Information: monitoring-info.md
- Alerting Information: alerting-info.md
- Observability:
- Observability Overview: observability-info.md
- Monitoring Overview: monitoring-info.md
- Alerting Overview: alerting-info.md
- Logging Overview: genestack-logging.md
- Third Party Tools:
- OSIE: extra-osie.md
- OpenStack: