diff --git a/proposals/2024-04-10-native-support-for-info-metrics-metadata.md b/proposals/2024-04-10-native-support-for-info-metrics-metadata.md index 1ab9363..8258ba8 100644 --- a/proposals/2024-04-10-native-support-for-info-metrics-metadata.md +++ b/proposals/2024-04-10-native-support-for-info-metrics-metadata.md @@ -1,4 +1,4 @@ -## Native Support for Info Metrics Metadata +# Add 1st class feature to PromQL for handling info type metrics * **Owners:** * Arve Knudsen [@aknuds1](https://github.com/aknuds1) [arve.knudsen@grafana.com](mailto:arve.knudsen@grafana.com) @@ -11,39 +11,68 @@ * **Other docs or links:** * [Proper support for OTEL resource attributes](https://docs.google.com/document/d/1FgHxOzCQ1Rom-PjHXsgujK8x5Xx3GTiwyG__U3Gd9Tw/edit#heading=h.unv3m5m27vuc) * [Special treatment of info metrics in Prometheus](https://docs.google.com/document/d/1ebhGNLs3uhdeprJCullM-ywA9iMRDg_mmnuFAQCloqY/edit#heading=h.2rmzk7oo6tu8) + * [Scenarios scratch pad](https://docs.google.com/document/d/1nV6N3pDfvZhmG2658huNbFSkz2rsM6SpkHabp9VVpw0/edit#heading=h.luf3yapzr29e) -> This proposal collects the requirements and implementation proposals for enhancing Prometheus with native support for info metrics metadata. +> This proposal collects the requirements and implementation proposals for adding a 1st class feature to PromQL for handling info type metrics. ## Why -Currently Prometheus "forgets" which are the identifying labels of info metrics upon ingestion, even though this information is present in at least the OpenMetrics protobuf exposition format (the OpenMetrics text exposition format unfortunately lacks this capability). -The fact that Prometheus lacks a notion of which are info metrics' identifying labels leads to certain problems: - +Currently, enriching Prometheus query results with corresponding labels from info metrics is challenging. +More specifically, it requires writing advanced PromQL to join with the info metric in question. +Take as an example querying HTTP request rates per K8s cluster and status code, while having to join with the `target_info` metric to obtain the `k8s_cluster_name` label: + +```promql +sum by (k8s_cluster_name, http_status_code) ( + rate(http_server_request_duration_seconds_count[2m]) + * on (job, instance) group_left (k8s_cluster_name) + target_info +) +``` + +The `target_info` metric is in fact the motivation for this proposal, as it's how Prometheus encodes OpenTelemetry (OTel for short) [resource attributes](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/resource/sdk.md). +As a result, it's a very important info metric for those using Prometheus as an OTel backend. +OTel resource attributes model metadata about the environment producing metrics received by the backend (e.g. Prometheus), and Prometheus persists them as labels of `target_info`. +Typically, OTel users want to include some of these attributes (as `target_info` labels) in their query results, to correlate them with entities of theirs (e.g. K8s pods). + +Based on user demand, it would be preferable if Prometheus were to have better UX for enriching query results with info metrics labels, especially with OTel in mind. +There are other problems with Prometheus' current method of including info metric labels in queries, beyond just the technical barrier: * Explicit knowledge of each info metric's identifying labels must be embedded in join queries for when you wish to enrich queries with data (non-identifying) labels from info metrics. -* Complex join queries must be written in order to enrich time series with corresponding labels from info metrics. - This is particularly problematic in the OpenTelemetry (AKA OTel) context, since users depend on (joining with) the `target_info` info metric in order to add relevant [resource attributes](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/resource/sdk.md) back to their Prometheus metrics. + A certain pair of OTel resource attributes (`service.name` and `service.instance.id`) are currently assumed to be the identifying pair and mapped to `target_info`'s `job` and `instance` labels respectively, but this may become a dynamic property of the OTel model. `service.instance.id` is also in fact optional, so the `instance` label may be empty. * If an info metric's data (non-identifying) labels change (a situation that should become more frequent with OTel in the future, as the model will probably start allowing for non-identifying resource attribute mutations), join queries against the info metric (e.g. `target_info`) will temporarily fail due to resolving the join keys to two different metrics, until the old metric is marked stale (by default after five minutes). -Especially in order to provide the best possible OTel experience, the info metric (`target_info` in the case of OTel) staleness problem needs to be solved, so users won't experience temporarily failing join queries while trying to include OTel resource attributes. -Also, it would be much better if we could provide a simpler query experience where the user doesn't have to know how to write PromQL joins (a fairly complex matter), in order to include e.g. OTel resource attributes. -Another possible positive outcome might be dedicated support in the Grafana UI for visualizing the resource attributes of each OTel metric. +If Prometheus could persist info metrics' identifying labels (e.g. `job` and `instance` for `target_info`), human knowledge of the correct identifying labels may become unnecessary when "joining" with info metrics. Information about info metric identifying labels is present in at least the OpenMetrics protobuf exposition format (the OpenMetrics text exposition format unfortunately lacks this capability). It can also easily be deduced when ingesting OTLP (OTel Protocol). +Intrinsic knowledge of info metrics' identifying labels could also help in solving the temporary conflict between old and new versions of info metrics, when data (non-identifying) labels change. +Another possible positive outcome might be dedicated support in UIs (e.g. Grafana) for visualizing the resource attributes of OTel metrics. ### Pitfalls of the current solution Prometheus currently persists info metrics as if they were normal float samples. This means that knowledge of info metrics' identifying labels are lost, and you have to base yourself on convention when querying on them (for example that `target_info` should have `job` and `instance` as identifying labels). +There's also no particular support for enriching query results with info metric labels in PromQL. The consequence is that you need relatively expert level PromQL knowledge to include info metric labels in your query results; as OTel grows in popularity, this becomes more and more of a problem as users will want to include certain labels from `target_info` (corresponding to OTel resource attributes). -Without persisted info metric metadata, one can't build more user friendly abstractions (e.g. a PromQL function) for including OTel resource attributes (or other info metric labels) in query results. Neither can you build dedicated UI for OTel resource attributes (or other info metric labels). +Without persisted info metric metadata, one can't build more user friendly abstractions (e.g. a PromQL function) for including OTel resource attributes (or other info metric labels) in query results. +Neither can you build dedicated UI for OTel resource attributes (or other info metric labels). ## Goals Goals and use cases for the solution as proposed in [How](#how): -* Persist info metrics with known identifying labels as a new info metric sample type. -* Store for each info metric sample (of the new type) which are the identifying labels. -* Store in the TSDB immediately that the previous version of an info metric is stale, when its data labels change. +* Persist info metrics with labels categorized as either identifying or non-identifying. +* Track when info metrics' set of identifying labels changes. This shouldn't be a frequent occurrence, but it should be handled. +* Automatically treat the old version of an info metric as stale for query result enriching purposes, when its data labels change (producing a new time series, but with same identity). * Add TSDB API for, given a certain time series and a certain timestamp, getting data labels, potentially filtered by certain matchers, from info metrics with identifying labels in common with the time series in question. -* Simplify inclusion of info metric labels in PromQL. +* Simplify enriching of query results with info metric labels in PromQL, e.g. via a new function. + +Using the `info` function, we can simplify the previously given PromQL join example as follows: + +``` +sum by (k8s_cluster_name, http_status_code) ( + info( + rate(http_server_request_duration_seconds_count[2m]), + {k8s_cluster_name=~".+"} + ) +) +``` ### Audience @@ -53,12 +82,13 @@ Prometheus maintainers. ## How -* A new info metric sample type will be introduced, where the sample value is the info metric's identifying labels. -* The head and block indexes will be augmented with indexes of info metrics. -* A method will be added to the TSDB API for matching info metric data labels to a time series, given a certain timestamp and potentially data label matchers - the method will use the aforementioned head and block info metric indexes. -* Thanks to the head and block info metric indexes, the info metric staleness problem should be solved, since one can pick the latest version of the info metric for overlapping time ranges. -* We propose simplifying the inclusion of info metric labels in PromQL through a new `info` function (TODO: describe). +* Introduce a new info metric sample type, to track the info metric's identifying label set over time (in case it changes). +* Augment the head and block indexes with indexes of info metrics, for easy finding of info metrics matching time series. +* Add a method to the TSDB API for matching info metric data labels to a time series, given a certain timestamp and potentially data label matchers - the method will use the aforementioned head and block info metric indexes. +* Simplify the inclusion of info metric labels in PromQL through a new `info` function: `info(v instant-vector[, ls label-selector])`. + This function will be UI for the aforementioned TSDB API. +TODO: * Make it concise and **simple**; put diagrams; be concrete, avoid using “really”, “amazing” and “great” (: * How you will test and verify? * How you will migrate users, without downtime. How we solve incompatibilities?