Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: add build duration metric #2185

Closed
wants to merge 1 commit into from

Conversation

jsternberg
Copy link
Collaborator

@jsternberg jsternberg commented Jan 11, 2024

This adds a build duration metric with attributes related to the build
id, build ref, driver name, and the resulting status of the build.

This also modifies some aspects of how the resource is configured by
including service.instance.id. This id is used to uniquely identify
the CLI invocation for use in downstream aggregators. This allows
downstream aggregators to know that the metric for an instance has not
reset and that it is a unique invocation for future aggregation. Some
work will have to be done in the aggregator to prevent the storage of
this instance id as it can cause issues with cardinality limitations,
but it's necessary for the initial reporter to include this.

The temporality selector is still the same, but has been removed from
the otlp exporter options since that one doesn't seem to do anything. I
also recently learned the temporality didn't do what I thought it did.
I thought it would allow for multiple different invocations to be
treated as the same instance for the purposes of aggregation. It does
not work this way as the metric sdk considers each of these a gap reset.
That's the reason the instance id was added. This makes the difference
between cumulative and delta mostly cosmetic, but delta temporality has
some benefits for downstream consumers for aggregation so it's likely
good to keep.

The cli count has been removed as it doesn't contain anything new and
wasn't a useful metric. The naming of metrics has also been changed from
using docker as a prefix and instead relying on the instrumentation
name.

@jsternberg jsternberg force-pushed the build-duration-metric branch 2 times, most recently from 7ec0de2 to 64af9d4 Compare January 11, 2024 19:30
@jsternberg jsternberg marked this pull request as draft January 11, 2024 22:45
@jsternberg jsternberg force-pushed the build-duration-metric branch from 64af9d4 to 6a79645 Compare January 11, 2024 23:00
@crazy-max crazy-max added this to the v0.13.0 milestone Jan 12, 2024
@jsternberg jsternberg force-pushed the build-duration-metric branch from 6a79645 to a1e7a1d Compare January 22, 2024 19:34
@jsternberg jsternberg requested a review from tonistiigi January 22, 2024 19:35
@jsternberg jsternberg marked this pull request as ready for review January 22, 2024 19:35
@jsternberg jsternberg force-pushed the build-duration-metric branch from a1e7a1d to 561b4d8 Compare January 24, 2024 20:18
type RecordFunc func(ctx context.Context, opts ...metric.RecordOption)

// Measure is a utility for measuring a time duration with certain attributes.
func Measure(ctx context.Context, name, desc string, opts ...metric.RecordOption) RecordFunc {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Measure" is too generic in here. Smth like StartTimer or MeasureDuration

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't end up finding this function useful on future changes to include more metrics so I've removed this function entirely.


// Measure is a utility for measuring a time duration with certain attributes.
func Measure(ctx context.Context, name, desc string, opts ...metric.RecordOption) RecordFunc {
histogram, err := Meter(ctx).Int64Histogram(name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand how it makes sense to call this histogram. A histogram stores distribution of values. This can only ever be called with one value through the lifetime of the process.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly just how these things get represented in OTEL. The metric that's being stored is a histogram, but the histogram only contains one point. If this histogram was aggregated with other histograms, you would get a more traditional histogram.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this to a counter anyway to reduce confusion. We can go back on this while we sort stuff out but it shouldn't matter.

@@ -0,0 +1,15 @@
package env
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, no internal packages in open source.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

// The primary difference between this metric reader and a more typical
// usage is that metric reporting only happens once when ReportFunc
// is invoked.
func MeterProvider(cli command.Cli) (metric.MeterProvider, ReportFunc, error) {
func Initialize(ctx context.Context, cli command.Cli) (context.Context, ReportFunc, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous name was better. Name the function of what it returns, if needed use New.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this package a bunch and moved most of the functionality to otel/sdk/metric. This is mostly to match with the structure of otel's libraries. The otel sdk is mostly meant to configure the more generic otel interfaces so we can wrap the otel sdk usage within there for convenience.

@@ -0,0 +1,15 @@
package env

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need a separate package you can use confutil call this exp

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved.

build/build.go Outdated
@@ -1523,3 +1540,47 @@ func ReadSourcePolicy() (*spb.Policy, error) {

return &pol, nil
}

// backendAttribute is a utility to retrieve the backend attribute from a resolvedNode.
func backendAttribute(dp *resolvedNode) attribute.KeyValue {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is driver

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

build/build.go Outdated

buildID := ""
if vcs != "" || target != "" || context != "" || filename != "" {
h := xxh3.New()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use sha256 everywhere. This is not a place where performance considerations would require new dependency.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use sha256.

h.WriteString(s)
h.Write([]byte{0})
}
buildID = hex.EncodeToString(h.Sum(nil))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to add something unique to node into this mix. Eg. see tryNodeIdentifier. Otherwise, this metric can leak data with a simple plaintext attack.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to be able to correlate the same parameters between different machines. I believe the node identifier is unique between machines. Would there be a different salt parameter we could use for this purpose that would still allow us to match build identities across machines?

build/build.go Outdated

// buildIDAttribute is a utility to retrieve the build id attribute from the solve options.
// This value should be consistent between builds.
func buildIDAttribute(so *client.SolveOpt) attribute.KeyValue {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is more like "Identity" not "Identifier", so should not be called "ID".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

build/build.go Outdated
@@ -686,6 +690,14 @@ func BuildWithResultHandler(ctx context.Context, nodes []builder.Node, opt map[s
return err
}

record := metrics.Measure(ctx, "build.duration", "Measures the total build duration.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it doesn't make sense track this as "buildkit solve request" duration, per build ref. For buildx side, I would think the duration is for how long buildx command ran. What about all the other error returns that make buildx fail? There is a return in between these current start-end calls as well. We can discuss more, but I think we are interested in commands, not requests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought on this was that this method would also cover invocations of bake. I'll experiment with changing the location so it covers only buildx.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried moving the location but it becomes difficult to get the parameters we need the further you move up the stack and it ends up relying on duplication between the runControllerBuild and runBasicBuild sections if I move it.

If you have a better location where this might fit, I can try to move it. I've now linked it with a progress writer because future metrics are going to read from the progress stream more directly so I think this location makes sense in that context.

@jsternberg jsternberg force-pushed the build-duration-metric branch 2 times, most recently from c8ac9a6 to 3764dfa Compare January 29, 2024 16:18
This adds a build duration metric with attributes related to the build
id, build ref, driver name, and the resulting status of the build.

This also modifies some aspects of how the resource is configured by
including `service.instance.id`. This id is used to uniquely identify
the CLI invocation for use in downstream aggregators. This allows
downstream aggregators to know that the metric for an instance has not
reset and that it is a unique invocation for future aggregation. Some
work will have to be done in the aggregator to prevent the storage of
this instance id as it can cause issues with cardinality limitations,
but it's necessary for the initial reporter to include this.

The temporality selector is still the same, but has been removed from
the otlp exporter options since that one doesn't seem to do anything.  I
also recently learned the temporality didn't do what I thought it did.
I thought it would allow for multiple different invocations to be
treated as the same instance for the purposes of aggregation. It does
not work this way as the metric sdk considers each of these a gap reset.
That's the reason the instance id was added. This makes the difference
between cumulative and delta mostly cosmetic, but delta temporality has
some benefits for downstream consumers for aggregation so it's likely
good to keep.

The cli count has been removed as it doesn't contain anything new and
wasn't a useful metric. The naming of metrics has also been changed from
using `docker` as a prefix and instead relying on the instrumentation
name.

Signed-off-by: Jonathan A. Sternberg <[email protected]>
@jsternberg jsternberg force-pushed the build-duration-metric branch from 3764dfa to ca644b5 Compare January 29, 2024 22:08
@jsternberg
Copy link
Collaborator Author

I've split out part of this into #2224 and also did a command-level version of this type of metric in #2225.

I'll rebase this one and keep it updated in case this is the metric we want. It's also fine if we merged all of the above.

@jsternberg jsternberg marked this pull request as draft January 30, 2024 22:30
@jsternberg
Copy link
Collaborator Author

Converting this to a draft while I work on updating the PR.

@jsternberg
Copy link
Collaborator Author

Just going to close this and maybe keep the branch local. I think we're opting not to go this direction right now. The code this is based on has also changed a bit for some other PRs related to this and it would probably be best to integrate this logic with those rather than try to rebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants