Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Add type and unit metadata labels #39

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
156 changes: 156 additions & 0 deletions proposals/2024-09-25_metadata-labels.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
## Type and Unit Metadata Labels

* **Owners:**
* David Ashpole [@dashpole](https://github.com/dashpole)

* **Implementation Status:** `Not implemented`

* **Related Issues and PRs:**
* https://github.com/open-telemetry/opentelemetry-specification/issues/2497

* **Other docs or links:**
* Survey Results: https://opentelemetry.io/blog/2024/prometheus-compatibility-survey/
* Slack thread: https://cloud-native.slack.com/archives/C01AUBA4PFE/p1726399373207819
* Doc with Options: https://docs.google.com/document/d/1t4ARkyOoI4lLNdKb0ixbUz7k7Mv_eCiq7sRKHAGZ9vg
* Prometheus PoC: https://github.com/prometheus/prometheus/compare/main...dashpole:prometheus:type_and_unit_labels

This document proposes adding the metric type and unit as labels on metrics.

## Why

Per [dev-summit consensus](https://docs.google.com/document/d/1uurQCi5iVufhYHGlBZ8mJMK_freDFKPG0iYBQqJ9fvA/edit#bookmark=id.q6upqm7itl24), we would like to avoid adding type and unit suffixes to metric names when translating from OpenTelemetry to Prometheus. These suffixes are currently required by the [compatibility specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/compatibility/prometheus_and_openmetrics.md). However, OpenTelemetry currently considers the metric type and unit identifying, whereas Prometheus does not. Simply removing suffixes might result in "collisions" between distinct OpenTelemetry metrics which have the same name, but different types (less commonly) or units.
dashpole marked this conversation as resolved.
Show resolved Hide resolved

### Pitfalls of the current solution

Roughly half of OpenTelemetry users surveyed preferred keeping metric names unmodified when translating to Prometheus. With our goal of being the default choice to store OpenTelemetry metrics, we should find a way to preserve metric names without introducing potential issues for users.

## Goals

Goals and use cases for the solution as proposed in [How](#how):

* [Required] OpenTelemetry users can query for the original names of their metrics.
* [Required] Users can filter by the metric type or unit in PromQL queries.
* [Required] PromQL returns a warning when querying across metrics with different units or types.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And PromQL could finally warn reliably about "inappropriate" operations (apply rate to a gauge, delta to a counter, …). (Recently, Prometheus has added an info-level annotation for cases where rate is applied to a metric that doesn't end on _total etc., but that's error prone (gauges could have a name ending on _total, and not all counters end on _total).)

By making unit and type a label (and thus part of the series' identity) it becomes implicitly harder to accidentally query across metrics with different unit or type anyway (there aren't any "mixed series" anymore). Maybe that is something we could work into the goals as preventing the mismatch from happening in the first place is a much more attractive selling point than just stating that we can "return a warning".

Copy link
Member

@ArthurSens ArthurSens Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say that's a happy side effect of the proposal, but not a requirement :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning about inappropriate operations sgtm. Added to goals.

In the context of preventing collisions from happening in the first place, i've been thinking about what best practices should be. If users write queries and always include unit and type filters in queries, it is impossible for mismatch to happen in the first place. It also makes queries more readable, as you can easily tell the type and unit of the metric when reading a query. But it does seem quite verbose, and I'm pessimistic that users would actually do this if we tell them they should.

* [Nice to have] OpenTelemetry users can grep for the original names in the text exposition.

### Audience

This document is for Prometheus server maintainers, PromQL maintainers, and anyone interested in furthering compatibility between Prometheus and OpenTelemetry.

## Non-Goals

* Prometheus will not attempt to auto-convert between units when there is a conflict.
* Prometheus will not attempt to auto-convert between types (e.g. native histogram vs float series).
* Prometheus client libraries will not allow mixing metrics with the same name, but different type or unit in the exposition format. See potential future extensions.

## User Experience

When querying for a metric, users can filter for a type or unit by specifying a filter on the `__unit__` or `__type__` labels, which use the reserved `__` prefix to ensure they do not collide with user-provided labels.

For example:

* `my_metric{}` returns all series with any type or unit, including `__type__` and `__unit__` labels.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the API I assume?

* `my_metric{__unit__="seconds", __type__="counter"}` returns only series with the specified type and unit.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both API and UI I assume?


When a query for a metric returns multiple metrics with a different `__type__` or `__unit__` label, but the same `__name__`, users see a warning in the UI.

Users don't see the `__type__` or `__unit__` labels in the Prometheus UI next to other labels, but the unit displayed next to the value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe the type next to the metric name and labels, but as a colored badge or something (cf. how it's done in the PGW, which utilizes the full exposition data model so it even has the help string, which is of course out of scope here).

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so this is UI, but what we envision in PromQL response? Should it give those labels back in JSON? If yes, in what order?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes back to TSDB API too. I assume TSDB API provide those labels as normal labels (worth to clarify?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, the alternative explains it, should we mention that both API and PromQL contains those labels as-is?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the TSDB API. Can you link to it?


Users see no difference to exposition formats.
dashpole marked this conversation as resolved.
Show resolved Hide resolved

## How

### PromQL Changes

When a query for a metric returns multiple metrics with a different `__type__` or `__unit__` label, but the same `__name__`, a warning annotation will be returned with the PromQL response.
dashpole marked this conversation as resolved.
Show resolved Hide resolved

### Prometheus UI Changes

When displaying a metric's labels in the table or in the graph views, the UI will hide labels starting with `__` (double underscore) by default, similar to the current handling of `__name__`. A "Show System Labels" check-box will be added, which shows hidden labels when checked.

### Prometheus Server Ingestion

When receiving OTLP or PRW 2.0, or when scraping the text, OM, or proto formats, the type and unit of the metric are added as the `__type__` and `__unit__` labels, if not already present.
ArthurSens marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For exposition formats, I guess it means this would allow:

TYPE metric counter
UNIT metric seconds
HELP ...
metric{__type__="gauge"}

Ideally I do opposite to what's written, so TYPE and UNIT wins here, or this could be blocked even by spec. Essentially I wonder about efficiency of parsers here.

For ingestion protocols, at least it feels metadata should override it (not what is written now) so it's easier for everyone.

Any further concerns around perf of parsing here for this @bboreham?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about what would be on the /federate endpoint if the Prometheus server scraped conflicting metrics from applications. Our choices are:

  1. [current] Only #UNIT metadata, and no __unit__ labels.
  2. Only __unit__ labels, and no #UNIT metadata.
  3. [proposal] Mixed __unit__ labels, and a #UNIT with one of the units the metric has.
  • 1 will result in collisions, but maybe that is acceptable for the /federate endpoint. It isn't a regression from what we have today.
  • 2 works, but would be breaking for existing users that are consuming #UNIT.
  • 3 is confusing because of the mixing, but I wanted to leave that open as an option.

I'm OK with disallowing mixing __unit__ and #UNIT, since we could relax that in the future if needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed offline, and:

  1. would epic to follow 1 and allow multiple the same named metric for different types and units in OM, let's try proposing that, it would help massively with exporters and federate
  2. type and unit from metadata is primary (in case of mix)
  3. Should we allow type and unit at all in exposition, SHOULD NOT maybe?

Copy link
Member

@ArthurSens ArthurSens Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too fond of adding __unit__ and __type__ as labels in the exposition format. I think they should stay as a single line at the top of the metric family like it is today.

The Unit/Type will be the same for all time series in a metric family. If they become labels in the exposition format, it will just be repeated information for all time series. It increases the payload, and we waste CPU time parsing it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have "Users see no difference to exposition formats." stated above, but if you put the type and especially the unit in the # UNIT part then you need them in the metric name as well, defeating the purpose. You're other option is to put them into the series on each line.

Looking at /metrics is for humans and follows what you see is what you get in terms of showing what you can query - implying that if you can query by __type__ or the API/UI returns the __type__ then you should put them on each line.

And the implication of that is that there needs to be at least one more efficient exposition format that is not so verbose for machines to read.

So I'm not sure this would work without either putting the new labels on every line or modifying OpenMetrics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course the above argument doesn't stand if we follow @beorn7 's advice and not show __type__, __unit__ on the query API (/query , /query_range) , although I would love to have it on the metadata queries endpoint like (/series) for Grafana to be able to know what's a float series and what's a native histogram.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should show __type__ and __unit__ in the query APIs. I'm just saying, in the zeroth step, we should remove those labels in PromQL operations (like the name is removed).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with showing __type__ and __unit__ in query APIs, just like we show __name__.

We have "Users see no difference to exposition formats." stated above, but if you put the type and especially the unit in the # UNIT part then you need them in the metric name as well, defeating the purpose. You're other option is to put them into the series on each line.

That's a really good catch! This is something we'll probably need to tackle with OM 2.0. Is this what you had in mind with this issue @bwplotka prometheus/OpenMetrics#280?


PRW 1.0 is omitted because metadata is sent separately from timeseries, making it infeasible to add the labels at ingestion time.

### Implementation Plan

#### Milestone 1: Feature flag for adding labels

Add a feature flag: `--enable-feature=metadata-labels`. When enabled `__type__` and `__unit__` labels are added when receiving OTLP or PRW 2.0, or when scraping the text, OM, or proto formats.
dashpole marked this conversation as resolved.
Show resolved Hide resolved

#### Milestone 2: UI and PromQL changes

During this stage, implement the UI and PromQL changes above. Iterate based on feedback. Changes should have no effect unless the metadata-labels feature flag is enabled.

#### Milestone 3: Add NoNameChanges option for OTLP translation

Add an option, `NoNameChanges` for the OTLP translation strategy. When enabled, it disables UTF-8 sanitization and the addition of suffixes. In the documentation for this option, recommend that `--enable-feature=metadata-labels` is enabled.

## Alternatives

### “Real” Type and Unit suffixes in **name**
dashpole marked this conversation as resolved.
Show resolved Hide resolved

Similar to suffixes of _<unit> and _<type/total> but make it an explicit suffix using a delimiter not currently permitted in metric names. Specifying suffixes is optional when querying for a metric. When the type or unit suffix is omitted from a query, it would (design TBD) return results which include any type or unit suffix which exists for that name.

NOTE: Dot for units and slash for type is just one example, there might be better operators/characters to use.

Writing queries that include the type and unit would be recommended as a best-practice by the community.

For example:

* Querying for foo/histogram would return results that include both foo.seconds/histogram and foo.milliseconds/histogram.
* Querying for foo.seconds would return results that include both foo.seconds/histogram and foo.seconds/counter.
* Querying for http_server_duration would return results that include both foo.seconds/histogram and foo.milliseconds/counter.
* Querying for an OpenTelemetry metric, such as http.server.duration, with suffixes would require querying for ”http.server.duration”.seconds/histogram. Note that suffixes are outside of quotes.

This solution is not chosen because:

* Requires PromQL changes (intrusive), touches on “dot” operator ideas.
* Adding suffixes outside of quotes looks strange: {“http.server.duration”.seconds/histogram}
* Rolling this out would be breaking for existing Prometheus users: E.g. {foo_seconds} becomes {foo.seconds/histogram}. Could this be part of OM 2.0?
* Mitigation: users just stay with {foo_seconds.seconds/histogram}
* Users might be surprised by, or dislike the additional suffixes and delimiters in the metric name results
* Mitigation: Opt-in for query engines?

### "Hide" __type__ and __unit__ labels in PromQL, instead of UI
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how safe this is. If PromQL does not hide the return, new labels will suddenly appear in users' queries.

I'm also not sure about the potential impact other than dashboards/alerts having new labels, could that be a problem?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm definitely open to this alternative. I like having the option of showing __type__ and __unit__ in the UI, but if it has to wait for Prom 4.0 because it is breaking, that seems like a significant downside.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it could be also merge of both, where PromQL only returns those labels based on something e.g. on conflict or when type or unit is selectors.

The downside is that we now have to really deeply design groupings and joins.. should they use those labels or not. With the current proposal the answer is trivial and straightforward. This helps with creating alerts and recording rules etc

I think I would be keen to try the current approach e.g. behind feature flag and see who will use it and how to tell more. cc @juliusv @roidelapluie WDYT from UX point of view for non-UI things?


Existing UIs don't handle the `__type__` and `__unit__` labels. To mitigate this, PromQL could omit the `__type__` and `__unit__` labels from the query response. Doing this would avoid requiring UIs to update to handle the new labels.

This solution is not chosen because:

* It deviates from the current handling of the `__name__` label.
* We expect metadata, like type and unit to be useful to display in the UI, and want to enable these use-cases.
* It should be a small amount of effort to hide these labels.

## Potential Future Extensions

### __type__ and __unit__ from client libraries

One obvious extension of this proposal would be for Prometheus clients to start sending `__type__` and `__unit__` labels with the exposition format. This would:

* Allow mixing metrics with the same name, but different types and units in the same endpoint (an explicit non-goal of this proposal).
* Allow sending unit metadata in the text exposition format, which doesn't support `# UNIT` metadata.
dashpole marked this conversation as resolved.
Show resolved Hide resolved
* Counter-point: It would be easier to add `# UNIT` metadata to the text exposition format.

This is excluded from this proposal because:

* OpenTelemetry users can use [Views](https://opentelemetry.io/docs/specs/otel/metrics/sdk/#view) to resolve collisions. That should be "good enough".
* Doing this would require some changes to client libraries, which is significantly more work, and is harder to experiment with.
* There can be conflicts between `# TYPE` and `# UNIT` metadata for a metric, and `__type__` and `__unit__` labels. This adds complexity to understanding the exposition format, and requires establishing rules for dealing with conflicts.
ArthurSens marked this conversation as resolved.
Show resolved Hide resolved

### More metadata labels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, I like the idea to model OTel's notion of "identifying metadata" as labels in Prometheus (to be distinguished from "non-identifying metadata", which is closer to Prometheus's original metadata idea).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately OTel resource attributes are all considered identifying today (correct me if I'm wrong David).

There is work in progress to replace resource attributes with a new thing called Entity. Entity will be able to distinguish, using otel lingo, Identifying from Descriptive attributes, but this will take some years to get popular in the industry

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTel resource attributes are all considered identifying today, but we don't treat them all as identifying when translating from OTel -> Prometheus. Instead, we just treat service.name + service.instance.id + service.namespace as identifying. That is correct if the OTLP originates from an OTel-instrumented application, but not if it wasn't (e.g. comes from an otel collector receiver).

Anyways, what I had in mind was more things that users don't think of as "labels", like schema url, or maybe scope name + version. I think resource attributes make sense as "real" labels, rather than metadata.


OpenTelemetry has lots of other metadata that may be a good fit for this "metadata label" pattern. For example, OpenTelemetry's scope name and version, or the schema URL are technically identifying for a time series, but are unlikely to be something that we want to display prominently in the UI.

If the pattern of adding `__type__` and `__unit__` works well for this metadata, we could consider making the pattern more generic:

* UIs should hide _all_ labels starting with `__` by default, not just `__name__`, `__type__`, and `__unit__`.
* Introduce a mechanism to allow OTel libraries to provide additional metadata labels. However, this has the potential to introduce collisions, since `__` has been reserved thus far. Maybe a specific allowlist (e.g. `__otel`-prefixed labels) could work.

## Action Plan

* [ ] Feature flag for adding labels
* [ ] UI and PromQL changes
* [ ] Add NoNameChanges option for OTLP translation
Loading