diff --git a/docs/collecting-application-metrics.md b/docs/collecting-application-metrics.md index 865f5741fc..12f163067c 100644 --- a/docs/collecting-application-metrics.md +++ b/docs/collecting-application-metrics.md @@ -9,7 +9,7 @@ There are two major sections: ## Scraping metrics -This section describes how to scrape metrics from your applications. Several scenarios has been covered: +This section describes how to scrape metrics from your applications. The following scenarios are covered: - [Application metrics are exposed (one endpoint scenario)](#application-metrics-are-exposed-one-endpoint-scenario) - [Application metrics are exposed (multiple enpoints scenario)](#application-metrics-are-exposed-multiple-enpoints-scenario) @@ -48,14 +48,6 @@ sumologic: endpoints: - port: "" path: - relabelings: - ## Sets _sumo_forward_ label to true - - sourceLabels: [__name__] - separator: ; - regex: (.*) - targetLabel: _sumo_forward_ - replacement: "true" - action: replace namespaceSelector: matchNames: - @@ -67,35 +59,8 @@ sumologic: **Note** For advanced serviceMonitor configuration, please look at the [Prometheus documentation][prometheus_service_monitors] -> **Note** If you not set `_sumo_forward_` label you will have to configure `additionalRemoteWrite`: -> -> ```yaml -> kube-prometheus-stack: -> prometheus: -> prometheusSpec: -> additionalRemoteWrite: -> ## This is required to keep default configuration. It's copy of values.yaml content -> - url: http://$(METADATA_METRICS_SVC).$(NAMESPACE).svc.cluster.local.:9888/prometheus.metrics.applications.custom -> remoteTimeout: 5s -> writeRelabelConfigs: -> - action: keep -> regex: ^true$ -> sourceLabels: [_sumo_forward_] -> - action: labeldrop -> regex: _sumo_forward_ -> ## This is your custom remoteWrite configuration -> - url: http://$(METADATA_METRICS_SVC).$(NAMESPACE).svc.cluster.local.:9888/prometheus.metrics. -> writeRelabelConfigs: -> - action: keep -> regex: ||... -> sourceLabels: [__name__] -> ``` -> -> We recommend using a regex validator, for example [https://regex101.com/] - [prometheus_service_monitors]: https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#monitoring.coreos.com/v1.ServiceMonitor -[https://regex101.com/]: https://regex101.com/ #### Example @@ -169,24 +134,8 @@ sumologic: endpoints: - port: some-port path: /metrics - relabelings: - ## Sets _sumo_forward_ label to true - - sourceLabels: [__name__] - separator: ; - regex: (.*) - targetLabel: _sumo_forward_ - replacement: "true" - action: replace - port: another-port path: /custom-endpoint - relabelings: - ## Sets _sumo_forward_ label to true - - sourceLabels: [__name__] - separator: ; - regex: (.*) - targetLabel: _sumo_forward_ - replacement: "true" - action: replace namespaceSelector: matchNames: - my-custom-app-namespace @@ -197,8 +146,8 @@ sumologic: ### Application metrics are not exposed -In case you want to scrape metrics from application which do not expose them, you can use telegraf operator. It will scrape metrics -according to configuration and expose them on port `9273` so Prometheus will be able to scrape them. +In case you want to scrape metrics from an application which does not expose a Prometheus endpoint, you can use telegraf operator. It will +scrape metrics according to configuration and expose them on port `9273` so Prometheus will be able to scrape them. For example to expose metrics from nginx Pod, you can use the following annotations: @@ -214,10 +163,10 @@ annotations: `sumologic-prometheus` defines the way telegraf operator will expose the metrics. They are going to be exposed in prometheus format on port `9273` and `/metrics` path. -**NOTE** If you apply annotations on Pod which is subject of other object, e.g. DaemonSet, it won't take affect. In such case, the +**NOTE** If you apply annotations on Pod which is owned by another object, e.g. DaemonSet, it won't take affect. In such case, the annotation should be added to Pod specification in DeamonSet template. -After restart, the Pod should have additional `telegraf` container. +After restart, the Pod should have an additional `telegraf` container. To scrape and forward exposed metrics to Sumo Logic, please follow one of the following scenarios: @@ -369,7 +318,6 @@ If you do not see your metrics in Sumo Logic, please check the following stages: - [Pod is visible in Prometheus targets](#pod-is-visible-in-prometheus-targets) - [There is no target for serviceMonitor](#there-is-no-target-for-servicemonitor) - [Pod is not visible in target for custom serviceMonitor](#pod-is-not-visible-in-target-for-custom-servicemonitor) - - [Check if Prometheus knows how to send metrics to Sumo Logic](#check-if-prometheus-knows-how-to-send-metrics-to-sumo-logic) ### Check if metrics are in Prometheus @@ -514,34 +462,3 @@ $ kubectl -n "${NAMESPACE}" describe prometheus If you don't see Pod you are expecting to see for your serviceMonitor, but serviceMonitor is in the Prometheus targets, please verify if `selector` and `namespaceSelector` in `additionalServiceMonitors` configuration are matching your Pod's namespace and labels. - -### Check if Prometheus knows how to send metrics to Sumo Logic - -If metrics are visible in Prometheus, but you cannot see them in Sumo Logic, please check if Prometheus knows how to send it to Sumo Logic -Metatada StatefulSet. - -Go to the [http://localhost:8000/config](http://localhost:8000/config) and verify if your metric definition is added to any `remote_write` -section. It most likely will be covered by: - -```yaml -- url: http://collection-sumologic-remote-write-proxy.sumologic.svc.cluster.local.:9888/prometheus.metrics.applications.custom - remote_timeout: 5s - write_relabel_configs: - - source_labels: [_sumo_forward_] - separator: ; - regex: ^true$ - replacement: $1 - action: keep - - separator: ; - regex: _sumo_forward_ - replacement: $1 - action: labeldrop -``` - -If there is no `remote_write` for your metric definition, you can add one using `additionalRemoteWrite` what has been described in -[Application metrics are exposed (multiple enpoints scenario)](#application-metrics-are-exposed-multiple-enpoints-scenario) section. - -However if you can see `remote_write` which matches your metrics and metrics are in Prometheus, we recommend to look at the Prometheus, -Prometheus Operator and OpenTelemetry Metrics Collector Pod logs. - -If the issue won't be solved, please create an issue or contact with our Customer Support. diff --git a/docs/collecting-kubernetes-metrics.md b/docs/collecting-kubernetes-metrics.md index 4c2f63ebee..0e61be56ad 100644 --- a/docs/collecting-kubernetes-metrics.md +++ b/docs/collecting-kubernetes-metrics.md @@ -11,11 +11,8 @@ By default, we collect selected metrics from the following Kubernetes components - `Kube State Metrics` configured with `kube-prometheus-stack.kube-state-metrics.prometheus.monitor` - `Prometheus Node Exporter` configured with `kube-prometheus-stack.prometheus-node-exporter.prometheus.monitor` -If you want to forward additional metric from one of these services, you need to make two configuration changes: - -- edit corresponding Service Monitor configuration. Service Monitor tells Prometheus which metrics it should take from the service -- ensure that the new metric is forwarded to metadata Pod, by adding new (or editing existing) Remote Write to - `kube-prometheus-stack.prometheus.prometheusSpec.additionalRemoteWrite` +If you want to forward additional metric from one of these services, you need to edit the corresponding Service Monitor definition. Service +Monitor tells Prometheus which metrics it should take from the service. ## Example @@ -23,8 +20,7 @@ Let's consider the following example: In addition to all metrics we send by default from CAdvisor you also want to forward `container_blkio_device_usage_total`. -You need to modify `kube-prometheus-stack.kubelet.serviceMonitor.cAdvisorMetricRelabelings` to include `container_blkio_device_usage_total` -in regex, and also to add `container_blkio_device_usage_total` to `kube-prometheus-stack.prometheus.prometheusSpec.additionalRemoteWrite`. +You need to modify `kube-prometheus-stack.kubelet.serviceMonitor.cAdvisorMetricRelabelings` to include `container_blkio_device_usage_total`. ```yaml kube-prometheus-stack: @@ -42,24 +38,6 @@ kube-prometheus-stack: regex: POD - action: labeldrop regex: (id|name) - prometheus: - prometheusSpec: - additionalRemoteWrite: - ## This is required to keep default configuration. It's copy of values.yaml content - - url: http://$(METADATA_METRICS_SVC).$(NAMESPACE).svc.cluster.local.:9888/prometheus.metrics.applications.custom - remoteTimeout: 5s - writeRelabelConfigs: - - action: keep - regex: ^true$ - sourceLabels: [_sumo_forward_] - - action: labeldrop - regex: _sumo_forward_ - ## This is your custom remoteWrite configuration - - url: http://$(METADATA_METRICS_SVC).$(NAMESPACE).svc.cluster.local.:9888/prometheus.metrics.custom_kubernetes_metrics - writeRelabelConfigs: - - action: keep - regex: container_blkio_device_usage_total - sourceLabels: [__name__] ``` **Note:** You can use the method described in diff --git a/docs/prometheus.md b/docs/prometheus.md index c0ccbf825b..9dfd1ac3ba 100644 --- a/docs/prometheus.md +++ b/docs/prometheus.md @@ -2,29 +2,28 @@ Prometheus is crucial part of the metrics pipeline. It is also a complicated and powerful tool. In Kubernetes specifically, it's also often managed by Prometheus Operator and a set of custom resources. It's possible that you already have some part of the K8s Prometheus stack -already installed, and would like to make use of it. This document describes how to deal with all the possible cases. +installed, and would like to make use of it. This document describes how to deal with all the possible cases. **NOTE:** In this document we assume that `${NAMESPACE}` represents namespace in which the Sumo Logic Kubernetes Collection is going to be installed. -- [No Prometheus in the cluster](#no-prometheus-in-the-cluster) -- [Prometheus Operator in the cluster](#prometheus-operator-in-the-cluster) - - [Custom Resource Definition compatibility](#custom-resource-definition-compatibility) - - [Installing Sumo Logic Prometheus Operator side by side with existing Operator](#installing-sumo-logic-prometheus-operator-side-by-side-with-existing-operator) - - [Set Sumo Logic Prometheus Operator to observe installation namespace](#set-sumo-logic-prometheus-operator-to-observe-installation-namespace) - - [Using existing Operator to create Sumo Logic Prometheus instance](#using-existing-operator-to-create-sumo-logic-prometheus-instance) - - [Disable Sumo Logic Prometheus Operator](#disable-sumo-logic-prometheus-operator) - - [Prepare Sumo Logic Configuration to work with existing Operator](#prepare-sumo-logic-configuration-to-work-with-existing-operator) - - [Using existing Kube Prometheus Stack](#using-existing-kube-prometheus-stack) - - [Build Prometheus Configuration](#build-prometheus-configuration) -- [Using a load balancing proxy for Prometheus remote write](#using-a-load-balancing-proxy-for-prometheus-remote-write) -- [Horizontal Scaling (Sharding)](#horizontal-scaling-sharding) -- [Troubleshooting](#troubleshooting) - - [UPGRADE FAILED: failed to create resource: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com"](#upgrade-failed-failed-to-create-resource-internal-error-occurred-failed-calling-webhook-prometheusrulemutatemonitoringcoreoscom) - - [Error: unable to build kubernetes objects from release manifest: error validating "": error validating data: ValidationError(Prometheus.spec)](#error-unable-to-build-kubernetes-objects-from-release-manifest-error-validating--error-validating-data-validationerrorprometheusspec) - +- [Prometheus](#prometheus) + - [No Prometheus in the cluster](#no-prometheus-in-the-cluster) + - [Prometheus Operator in the cluster](#prometheus-operator-in-the-cluster) + - [Custom Resource Definition compatibility](#custom-resource-definition-compatibility) + - [Installing Sumo Logic Prometheus Operator side by side with existing Operator](#installing-sumo-logic-prometheus-operator-side-by-side-with-existing-operator) + - [Set Sumo Logic Prometheus Operator to observe installation namespace](#set-sumo-logic-prometheus-operator-to-observe-installation-namespace) + - [Using existing Operator to create Sumo Logic Prometheus instance](#using-existing-operator-to-create-sumo-logic-prometheus-instance) + - [Disable Sumo Logic Prometheus Operator](#disable-sumo-logic-prometheus-operator) + - [Prepare Sumo Logic Configuration to work with existing Operator](#prepare-sumo-logic-configuration-to-work-with-existing-operator) + - [Using existing Kube Prometheus Stack](#using-existing-kube-prometheus-stack) + - [Build Prometheus Configuration](#build-prometheus-configuration) + - [Horizontal Scaling (Sharding)](#horizontal-scaling-sharding) + - [Troubleshooting](#troubleshooting) + - [UPGRADE FAILED: failed to create resource: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com"](#upgrade-failed-failed-to-create-resource-internal-error-occurred-failed-calling-webhook-prometheusrulemutatemonitoringcoreoscom) + - [Error: unable to build kubernetes objects from release manifest: error validating "": error validating data: ValidationError(Prometheus.spec)](#error-unable-to-build-kubernetes-objects-from-release-manifest-error-validating--error-validating-data-validationerrorprometheusspec) ## No Prometheus in the cluster @@ -235,21 +234,19 @@ are correctly added to your Kube Prometheus Stack configuration: - ServiceMonitors configuration: - - `sumologic.metrics.ServiceMonitors` to `prometheus.additionalServiceMonitors` + - `sumologic.metrics.ServiceMonitors` and `sumologic.metrics.additionalServiceMonitors` to `prometheus.additionalServiceMonitors` - RemoteWrite configuration: - `kube-prometheus-stack.prometheus.prometheusSpec.remoteWrite` to `prometheus.prometheusSpec.remoteWrite` or `prometheus.prometheusSpec.additionalRemoteWrite` - - `kube-prometheus-stack.prometheus.prometheusSpec.additionalRemoteWrite` to `prometheus.prometheusSpec.remoteWrite` or - `prometheus.prometheusSpec.additionalRemoteWrite` **Note:** `kube-prometheus-stack.prometheus.prometheusSpec.remoteWrite` and `kube-prometheus-stack.prometheus.prometheusSpec.additionalRemoteWrite` are being use to generate list of endpoints in Metadata Pod, so ensure that: - they are always in sync with the current configuration and endpoints starts with. - - url is always starting with `http://$(METADATA_METRICS_SVC).$(NAMESPACE).svc.cluster.local.:9888` + - url always starts with `http://$(METADATA_METRICS_SVC).$(NAMESPACE).svc.cluster.local.:9888` Alternatively, you can list endpoints in `metadata.metrics.config.additionalEndpoints`: @@ -258,8 +255,7 @@ are correctly added to your Kube Prometheus Stack configuration: metrics: config: additionalEndpoints: - - /prometheus.metrics.state - - /prometheus.metrics.controller-manager + - /prometheus.metrics # - ... ``` @@ -285,7 +281,7 @@ are correctly added to your Kube Prometheus Stack configuration: value: ${METADATA} ``` - where `${METADATA}` is content of `metadataMetrics` key from `sumologic-configmap` Config Map within `${NAMESPACE}`: + where `${METADATA}` is content of `metadataMetrics` key from `sumologic-configmap` ConfigMap within `${NAMESPACE}`: ```yaml apiVersion: v1 @@ -338,20 +334,10 @@ prometheus: # values copied from kube-prometheus-stack.prometheus.prometheusSpec.containers additionalRemoteWrite: # values copied from kube-prometheus-stack.prometheus.prometheusSpec.remoteWrite - # values copied from kube-prometheus-stack.prometheus.prometheusSpec.additionalRemoteWrite ``` Prometheus configuration is ready. Apply the changes on the cluster. -## Using a load balancing proxy for Prometheus remote write - -In environments with a high volume of metrics (problems may start appearing around 30k samples per second), the above mitigations may not be -sufficient. It is possible to remedy the problem by sharding Prometheus itself, but that can be complicated to set up and require manual -intervention to scale. - -A simpler alternative is to put a HTTP load balancer between Prometheus and the metrics metadata Service. This is enabled in `values.yaml` -via the `sumologic.metrics.remoteWriteProxy.enabled` key. - ## Horizontal Scaling (Sharding) Horizontal scaling, also known as sharding, is supported by setting up a configuration parameter which allows running several prometheus diff --git a/docs/troubleshoot-collection.md b/docs/troubleshoot-collection.md index 167e157151..066fdcea0f 100644 --- a/docs/troubleshoot-collection.md +++ b/docs/troubleshoot-collection.md @@ -2,33 +2,34 @@ -- [Troubleshooting Installation](#troubleshooting-installation) -- [Namespace configuration](#namespace-configuration) -- [Collecting logs](#collecting-logs) - - [Check log throttling](#check-log-throttling) - - [Check ingest budget limits](#check-ingest-budget-limits) - - [Check if collection pods are in a healthy state](#check-if-collection-pods-are-in-a-healthy-state) - - [Prometheus Logs](#prometheus-logs) - - [OpenTelemetry Logs Collector is being CPU throttled](#opentelemetry-logs-collector-is-being-cpu-throttled) -- [Collecting metrics](#collecting-metrics) - - [Check the `/metrics` endpoint](#check-the-metrics-endpoint) - - [Check the `/metrics` endpoint for Kubernetes services](#check-the-metrics-endpoint-for-kubernetes-services) - - [Check the Prometheus UI](#check-the-prometheus-ui) - - [Check Prometheus Remote Storage](#check-prometheus-remote-storage) -- [Common Issues](#common-issues) - - [Missing metrics - cannot see cluster in Explore](#missing-metrics---cannot-see-cluster-in-explore) - - [Pod stuck in `ContainerCreating` state](#pod-stuck-in-containercreating-state) - - [Missing `kubelet` metrics](#missing-kubelet-metrics) - - [1. Enable the `authenticationTokenWebhook` flag in the cluster](#1-enable-the-authenticationtokenwebhook-flag-in-the-cluster) - - [2. Disable the `kubelet.serviceMonitor.https` flag in Kube Prometheus Stack](#2-disable-the-kubeletservicemonitorhttps-flag-in-kube-prometheus-stack) - - [Missing `kube-controller-manager` or `kube-scheduler` metrics](#missing-kube-controller-manager-or-kube-scheduler-metrics) - - [Prometheus stuck in `Terminating` state after running `helm del collection`](#prometheus-stuck-in-terminating-state-after-running-helm-del-collection) - - [Rancher](#rancher) - - [Falco and Google Kubernetes Engine (GKE)](#falco-and-google-kubernetes-engine-gke) - - [Falco and OpenShift](#falco-and-openshift) - - [Out of memory (OOM) failures for Prometheus Pod](#out-of-memory-oom-failures-for-prometheus-pod) - - [Prometheus: server returned HTTP status 404 Not Found: 404 page not found](#prometheus-server-returned-http-status-404-not-found-404-page-not-found) - - [OpenTelemetry: dial tcp: lookup collection-sumologic-metadata-logs.sumologic.svc.cluster.local.: device or resource busy](#opentelemetry-dial-tcp-lookup-collection-sumologic-metadata-logssumologicsvcclusterlocal-device-or-resource-busy) +- [Troubleshooting Collection](#troubleshooting-collection) + - [Troubleshooting Installation](#troubleshooting-installation) + - [Namespace configuration](#namespace-configuration) + - [Collecting logs](#collecting-logs) + - [Check log throttling](#check-log-throttling) + - [Check ingest budget limits](#check-ingest-budget-limits) + - [Check if collection pods are in a healthy state](#check-if-collection-pods-are-in-a-healthy-state) + - [Prometheus Logs](#prometheus-logs) + - [OpenTelemetry Logs Collector is being CPU throttled](#opentelemetry-logs-collector-is-being-cpu-throttled) + - [Collecting metrics](#collecting-metrics) + - [Check the `/metrics` endpoint](#check-the-metrics-endpoint) + - [Check the `/metrics` endpoint for Kubernetes services](#check-the-metrics-endpoint-for-kubernetes-services) + - [Check the Prometheus UI](#check-the-prometheus-ui) + - [Check Prometheus Remote Storage](#check-prometheus-remote-storage) + - [Common Issues](#common-issues) + - [Missing metrics - cannot see cluster in Explore](#missing-metrics---cannot-see-cluster-in-explore) + - [Pod stuck in `ContainerCreating` state](#pod-stuck-in-containercreating-state) + - [Missing `kubelet` metrics](#missing-kubelet-metrics) + - [1. Enable the `authenticationTokenWebhook` flag in the cluster](#1-enable-the-authenticationtokenwebhook-flag-in-the-cluster) + - [2. Disable the `kubelet.serviceMonitor.https` flag in Kube Prometheus Stack](#2-disable-the-kubeletservicemonitorhttps-flag-in-kube-prometheus-stack) + - [Missing `kube-controller-manager` or `kube-scheduler` metrics](#missing-kube-controller-manager-or-kube-scheduler-metrics) + - [Prometheus stuck in `Terminating` state after running `helm del collection`](#prometheus-stuck-in-terminating-state-after-running-helm-del-collection) + - [Rancher](#rancher) + - [Falco and Google Kubernetes Engine (GKE)](#falco-and-google-kubernetes-engine-gke) + - [Falco and OpenShift](#falco-and-openshift) + - [Out of memory (OOM) failures for Prometheus Pod](#out-of-memory-oom-failures-for-prometheus-pod) + - [Prometheus: server returned HTTP status 404 Not Found: 404 page not found](#prometheus-server-returned-http-status-404-not-found-404-page-not-found) + - [OpenTelemetry: dial tcp: lookup collection-sumologic-metadata-logs.sumologic.svc.cluster.local.: device or resource busy](#opentelemetry-dial-tcp-lookup-collection-sumologic-metadata-logssumologicsvcclusterlocal-device-or-resource-busy) @@ -428,11 +429,11 @@ kube-prometheus-stack: prometheus: prometheusSpec: remoteWrite: - - url: http://$(METADATA_METRICS_SVC).$(NAMESPACE).svc.cluster.local.:9888/prometheus.metrics.state + - url: http://$(METADATA_METRICS_SVC).$(NAMESPACE).svc.cluster.local.:9888/prometheus.metrics ... ``` -Alternatively you can add `/prometheus.metrics.kubelet` to `metadata.metrics.config.additionalEndpoints` +Alternatively you can add `/prometheus.metrics` to `metadata.metrics.config.additionalEndpoints` Please see the following example: