metrics: add build duration metric #2185

jsternberg · 2024-01-11T16:21:26Z

This adds a build duration metric with attributes related to the build
id, build ref, driver name, and the resulting status of the build.

This also modifies some aspects of how the resource is configured by
including service.instance.id. This id is used to uniquely identify
the CLI invocation for use in downstream aggregators. This allows
downstream aggregators to know that the metric for an instance has not
reset and that it is a unique invocation for future aggregation. Some
work will have to be done in the aggregator to prevent the storage of
this instance id as it can cause issues with cardinality limitations,
but it's necessary for the initial reporter to include this.

The temporality selector is still the same, but has been removed from
the otlp exporter options since that one doesn't seem to do anything. I
also recently learned the temporality didn't do what I thought it did.
I thought it would allow for multiple different invocations to be
treated as the same instance for the purposes of aggregation. It does
not work this way as the metric sdk considers each of these a gap reset.
That's the reason the instance id was added. This makes the difference
between cumulative and delta mostly cosmetic, but delta temporality has
some benefits for downstream consumers for aggregation so it's likely
good to keep.

The cli count has been removed as it doesn't contain anything new and
wasn't a useful metric. The naming of metrics has also been changed from
using docker as a prefix and instead relying on the instrumentation
name.

tonistiigi · 2024-01-25T05:48:49Z

internal/metrics/util.go

+type RecordFunc func(ctx context.Context, opts ...metric.RecordOption)
+
+// Measure is a utility for measuring a time duration with certain attributes.
+func Measure(ctx context.Context, name, desc string, opts ...metric.RecordOption) RecordFunc {


"Measure" is too generic in here. Smth like StartTimer or MeasureDuration

I didn't end up finding this function useful on future changes to include more metrics so I've removed this function entirely.

tonistiigi · 2024-01-25T05:52:31Z

internal/metrics/util.go

+
+// Measure is a utility for measuring a time duration with certain attributes.
+func Measure(ctx context.Context, name, desc string, opts ...metric.RecordOption) RecordFunc {
+	histogram, err := Meter(ctx).Int64Histogram(name,


I do not understand how it makes sense to call this histogram. A histogram stores distribution of values. This can only ever be called with one value through the lifetime of the process.

This is mostly just how these things get represented in OTEL. The metric that's being stored is a histogram, but the histogram only contains one point. If this histogram was aggregated with other histograms, you would get a more traditional histogram.

Changed this to a counter anyway to reduce confusion. We can go back on this while we sort stuff out but it shouldn't matter.

tonistiigi · 2024-01-25T05:53:17Z

internal/env/env.go

@@ -0,0 +1,15 @@
+package env


Sorry, no internal packages in open source.

tonistiigi · 2024-01-25T05:57:45Z

internal/metrics/metrics.go

 // The primary difference between this metric reader and a more typical
 // usage is that metric reporting only happens once when ReportFunc
 // is invoked.
-func MeterProvider(cli command.Cli) (metric.MeterProvider, ReportFunc, error) {
+func Initialize(ctx context.Context, cli command.Cli) (context.Context, ReportFunc, error) {


Previous name was better. Name the function of what it returns, if needed use New.

Changed this package a bunch and moved most of the functionality to otel/sdk/metric. This is mostly to match with the structure of otel's libraries. The otel sdk is mostly meant to configure the more generic otel interfaces so we can wrap the otel sdk usage within there for convenience.

tonistiigi · 2024-01-25T06:03:21Z

internal/env/env.go

@@ -0,0 +1,15 @@
+package env
+


If you need a separate package you can use confutil call this exp

tonistiigi · 2024-01-25T06:12:58Z

build/build.go

@@ -1523,3 +1540,47 @@ func ReadSourcePolicy() (*spb.Policy, error) {

 	return &pol, nil
 }
+
+// backendAttribute is a utility to retrieve the backend attribute from a resolvedNode.
+func backendAttribute(dp *resolvedNode) attribute.KeyValue {


This is driver

tonistiigi · 2024-01-25T06:16:34Z

build/build.go

+
+	buildID := ""
+	if vcs != "" || target != "" || context != "" || filename != "" {
+		h := xxh3.New()


We use sha256 everywhere. This is not a place where performance considerations would require new dependency.

Updated to use sha256.

tonistiigi · 2024-01-25T06:20:42Z

build/build.go

+			h.WriteString(s)
+			h.Write([]byte{0})
+		}
+		buildID = hex.EncodeToString(h.Sum(nil))


I think we need to add something unique to node into this mix. Eg. see tryNodeIdentifier. Otherwise, this metric can leak data with a simple plaintext attack.

I think we need to be able to correlate the same parameters between different machines. I believe the node identifier is unique between machines. Would there be a different salt parameter we could use for this purpose that would still allow us to match build identities across machines?

tonistiigi · 2024-01-25T06:23:06Z

build/build.go

+
+// buildIDAttribute is a utility to retrieve the build id attribute from the solve options.
+// This value should be consistent between builds.
+func buildIDAttribute(so *client.SolveOpt) attribute.KeyValue {


I think this is more like "Identity" not "Identifier", so should not be called "ID".

tonistiigi · 2024-01-25T06:29:12Z

build/build.go

@@ -686,6 +690,14 @@ func BuildWithResultHandler(ctx context.Context, nodes []builder.Node, opt map[s
 						return err
 					}

+					record := metrics.Measure(ctx, "build.duration", "Measures the total build duration.",


I think it doesn't make sense track this as "buildkit solve request" duration, per build ref. For buildx side, I would think the duration is for how long buildx command ran. What about all the other error returns that make buildx fail? There is a return in between these current start-end calls as well. We can discuss more, but I think we are interested in commands, not requests.

My thought on this was that this method would also cover invocations of bake. I'll experiment with changing the location so it covers only buildx.

I tried moving the location but it becomes difficult to get the parameters we need the further you move up the stack and it ends up relying on duplication between the runControllerBuild and runBasicBuild sections if I move it.

If you have a better location where this might fit, I can try to move it. I've now linked it with a progress writer because future metrics are going to read from the progress stream more directly so I think this location makes sense in that context.

This adds a build duration metric with attributes related to the build id, build ref, driver name, and the resulting status of the build. This also modifies some aspects of how the resource is configured by including `service.instance.id`. This id is used to uniquely identify the CLI invocation for use in downstream aggregators. This allows downstream aggregators to know that the metric for an instance has not reset and that it is a unique invocation for future aggregation. Some work will have to be done in the aggregator to prevent the storage of this instance id as it can cause issues with cardinality limitations, but it's necessary for the initial reporter to include this. The temporality selector is still the same, but has been removed from the otlp exporter options since that one doesn't seem to do anything. I also recently learned the temporality didn't do what I thought it did. I thought it would allow for multiple different invocations to be treated as the same instance for the purposes of aggregation. It does not work this way as the metric sdk considers each of these a gap reset. That's the reason the instance id was added. This makes the difference between cumulative and delta mostly cosmetic, but delta temporality has some benefits for downstream consumers for aggregation so it's likely good to keep. The cli count has been removed as it doesn't contain anything new and wasn't a useful metric. The naming of metrics has also been changed from using `docker` as a prefix and instead relying on the instrumentation name. Signed-off-by: Jonathan A. Sternberg <[email protected]>

jsternberg · 2024-01-30T22:04:20Z

I've split out part of this into #2224 and also did a command-level version of this type of metric in #2225.

I'll rebase this one and keep it updated in case this is the metric we want. It's also fine if we merged all of the above.

jsternberg · 2024-01-30T22:31:05Z

Converting this to a draft while I work on updating the PR.

jsternberg · 2024-02-01T18:11:52Z

Just going to close this and maybe keep the branch local. I think we're opting not to go this direction right now. The code this is based on has also changed a bit for some other PRs related to this and it would probably be best to integrate this logic with those rather than try to rebase.

jsternberg force-pushed the build-duration-metric branch 2 times, most recently from 7ec0de2 to 64af9d4 Compare January 11, 2024 19:30

jsternberg marked this pull request as draft January 11, 2024 22:45

jsternberg force-pushed the build-duration-metric branch from 64af9d4 to 6a79645 Compare January 11, 2024 23:00

crazy-max added this to the v0.13.0 milestone Jan 12, 2024

jsternberg force-pushed the build-duration-metric branch from 6a79645 to a1e7a1d Compare January 22, 2024 19:34

jsternberg requested a review from tonistiigi January 22, 2024 19:35

jsternberg marked this pull request as ready for review January 22, 2024 19:35

jsternberg force-pushed the build-duration-metric branch from a1e7a1d to 561b4d8 Compare January 24, 2024 20:18

tonistiigi reviewed Jan 25, 2024

View reviewed changes

jsternberg force-pushed the build-duration-metric branch 2 times, most recently from c8ac9a6 to 3764dfa Compare January 29, 2024 16:18

jsternberg force-pushed the build-duration-metric branch from 3764dfa to ca644b5 Compare January 29, 2024 22:08

jsternberg marked this pull request as draft January 30, 2024 22:30

jsternberg closed this Feb 1, 2024

tonistiigi mentioned this pull request Feb 2, 2024

metrics: add build command duration metric #2225

Merged

crazy-max added the area/metrics label Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metrics: add build duration metric #2185

metrics: add build duration metric #2185

jsternberg commented Jan 11, 2024 •

edited

Loading

tonistiigi Jan 25, 2024

jsternberg Jan 26, 2024

tonistiigi Jan 25, 2024

jsternberg Jan 25, 2024

jsternberg Jan 26, 2024

tonistiigi Jan 25, 2024

jsternberg Jan 26, 2024

tonistiigi Jan 25, 2024

jsternberg Jan 26, 2024

tonistiigi Jan 25, 2024

jsternberg Jan 26, 2024

tonistiigi Jan 25, 2024

jsternberg Jan 26, 2024

tonistiigi Jan 25, 2024

jsternberg Jan 26, 2024

tonistiigi Jan 25, 2024

jsternberg Jan 26, 2024

tonistiigi Jan 25, 2024

jsternberg Jan 26, 2024

tonistiigi Jan 25, 2024

jsternberg Jan 26, 2024

jsternberg Jan 26, 2024

jsternberg commented Jan 30, 2024

jsternberg commented Jan 30, 2024

jsternberg commented Feb 1, 2024

metrics: add build duration metric #2185

metrics: add build duration metric #2185

Conversation

jsternberg commented Jan 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsternberg commented Jan 30, 2024

jsternberg commented Jan 30, 2024

jsternberg commented Feb 1, 2024

jsternberg commented Jan 11, 2024 •

edited

Loading