-
Notifications
You must be signed in to change notification settings - Fork 489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metrics: add build duration metric #2185
Conversation
7ec0de2
to
64af9d4
Compare
64af9d4
to
6a79645
Compare
6a79645
to
a1e7a1d
Compare
a1e7a1d
to
561b4d8
Compare
internal/metrics/util.go
Outdated
type RecordFunc func(ctx context.Context, opts ...metric.RecordOption) | ||
|
||
// Measure is a utility for measuring a time duration with certain attributes. | ||
func Measure(ctx context.Context, name, desc string, opts ...metric.RecordOption) RecordFunc { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Measure" is too generic in here. Smth like StartTimer
or MeasureDuration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't end up finding this function useful on future changes to include more metrics so I've removed this function entirely.
internal/metrics/util.go
Outdated
|
||
// Measure is a utility for measuring a time duration with certain attributes. | ||
func Measure(ctx context.Context, name, desc string, opts ...metric.RecordOption) RecordFunc { | ||
histogram, err := Meter(ctx).Int64Histogram(name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand how it makes sense to call this histogram. A histogram stores distribution of values. This can only ever be called with one value through the lifetime of the process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly just how these things get represented in OTEL. The metric that's being stored is a histogram, but the histogram only contains one point. If this histogram was aggregated with other histograms, you would get a more traditional histogram.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this to a counter anyway to reduce confusion. We can go back on this while we sort stuff out but it shouldn't matter.
internal/env/env.go
Outdated
@@ -0,0 +1,15 @@ | |||
package env |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, no internal packages in open source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
internal/metrics/metrics.go
Outdated
// The primary difference between this metric reader and a more typical | ||
// usage is that metric reporting only happens once when ReportFunc | ||
// is invoked. | ||
func MeterProvider(cli command.Cli) (metric.MeterProvider, ReportFunc, error) { | ||
func Initialize(ctx context.Context, cli command.Cli) (context.Context, ReportFunc, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous name was better. Name the function of what it returns, if needed use New
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this package a bunch and moved most of the functionality to otel/sdk/metric
. This is mostly to match with the structure of otel's libraries. The otel sdk is mostly meant to configure the more generic otel interfaces so we can wrap the otel sdk usage within there for convenience.
internal/env/env.go
Outdated
@@ -0,0 +1,15 @@ | |||
package env | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you need a separate package you can use confutil
call this exp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved.
build/build.go
Outdated
@@ -1523,3 +1540,47 @@ func ReadSourcePolicy() (*spb.Policy, error) { | |||
|
|||
return &pol, nil | |||
} | |||
|
|||
// backendAttribute is a utility to retrieve the backend attribute from a resolvedNode. | |||
func backendAttribute(dp *resolvedNode) attribute.KeyValue { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is driver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
build/build.go
Outdated
|
||
buildID := "" | ||
if vcs != "" || target != "" || context != "" || filename != "" { | ||
h := xxh3.New() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use sha256 everywhere. This is not a place where performance considerations would require new dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use sha256.
h.WriteString(s) | ||
h.Write([]byte{0}) | ||
} | ||
buildID = hex.EncodeToString(h.Sum(nil)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to add something unique to node into this mix. Eg. see tryNodeIdentifier
. Otherwise, this metric can leak data with a simple plaintext attack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to be able to correlate the same parameters between different machines. I believe the node identifier is unique between machines. Would there be a different salt parameter we could use for this purpose that would still allow us to match build identities across machines?
build/build.go
Outdated
|
||
// buildIDAttribute is a utility to retrieve the build id attribute from the solve options. | ||
// This value should be consistent between builds. | ||
func buildIDAttribute(so *client.SolveOpt) attribute.KeyValue { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is more like "Identity" not "Identifier", so should not be called "ID".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
build/build.go
Outdated
@@ -686,6 +690,14 @@ func BuildWithResultHandler(ctx context.Context, nodes []builder.Node, opt map[s | |||
return err | |||
} | |||
|
|||
record := metrics.Measure(ctx, "build.duration", "Measures the total build duration.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it doesn't make sense track this as "buildkit solve request" duration, per build ref. For buildx side, I would think the duration is for how long buildx command ran. What about all the other error returns that make buildx fail? There is a return in between these current start-end calls as well. We can discuss more, but I think we are interested in commands, not requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought on this was that this method would also cover invocations of bake
. I'll experiment with changing the location so it covers only buildx.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried moving the location but it becomes difficult to get the parameters we need the further you move up the stack and it ends up relying on duplication between the runControllerBuild
and runBasicBuild
sections if I move it.
If you have a better location where this might fit, I can try to move it. I've now linked it with a progress writer because future metrics are going to read from the progress stream more directly so I think this location makes sense in that context.
c8ac9a6
to
3764dfa
Compare
This adds a build duration metric with attributes related to the build id, build ref, driver name, and the resulting status of the build. This also modifies some aspects of how the resource is configured by including `service.instance.id`. This id is used to uniquely identify the CLI invocation for use in downstream aggregators. This allows downstream aggregators to know that the metric for an instance has not reset and that it is a unique invocation for future aggregation. Some work will have to be done in the aggregator to prevent the storage of this instance id as it can cause issues with cardinality limitations, but it's necessary for the initial reporter to include this. The temporality selector is still the same, but has been removed from the otlp exporter options since that one doesn't seem to do anything. I also recently learned the temporality didn't do what I thought it did. I thought it would allow for multiple different invocations to be treated as the same instance for the purposes of aggregation. It does not work this way as the metric sdk considers each of these a gap reset. That's the reason the instance id was added. This makes the difference between cumulative and delta mostly cosmetic, but delta temporality has some benefits for downstream consumers for aggregation so it's likely good to keep. The cli count has been removed as it doesn't contain anything new and wasn't a useful metric. The naming of metrics has also been changed from using `docker` as a prefix and instead relying on the instrumentation name. Signed-off-by: Jonathan A. Sternberg <[email protected]>
3764dfa
to
ca644b5
Compare
Converting this to a draft while I work on updating the PR. |
Just going to close this and maybe keep the branch local. I think we're opting not to go this direction right now. The code this is based on has also changed a bit for some other PRs related to this and it would probably be best to integrate this logic with those rather than try to rebase. |
This adds a build duration metric with attributes related to the build
id, build ref, driver name, and the resulting status of the build.
This also modifies some aspects of how the resource is configured by
including
service.instance.id
. This id is used to uniquely identifythe CLI invocation for use in downstream aggregators. This allows
downstream aggregators to know that the metric for an instance has not
reset and that it is a unique invocation for future aggregation. Some
work will have to be done in the aggregator to prevent the storage of
this instance id as it can cause issues with cardinality limitations,
but it's necessary for the initial reporter to include this.
The temporality selector is still the same, but has been removed from
the otlp exporter options since that one doesn't seem to do anything. I
also recently learned the temporality didn't do what I thought it did.
I thought it would allow for multiple different invocations to be
treated as the same instance for the purposes of aggregation. It does
not work this way as the metric sdk considers each of these a gap reset.
That's the reason the instance id was added. This makes the difference
between cumulative and delta mostly cosmetic, but delta temporality has
some benefits for downstream consumers for aggregation so it's likely
good to keep.
The cli count has been removed as it doesn't contain anything new and
wasn't a useful metric. The naming of metrics has also been changed from
using
docker
as a prefix and instead relying on the instrumentationname.