Export ActionSet Metrics for Kanister #1603

aish-where-ya · 2022-08-19T17:43:42Z

Change Overview

Created a skeleton for a pipeline to export metrics.
A sample Backup ActionSet Created metric is exposed. This is a counter that will increment every time an ActionSet of Backup type is created.
Exposed Metrics are available at /metrics.
Added Metric type definitions for future metrics to be exposed.
Added functionality to initialize gauges for these additional metrics.

Pull request type

Please check the type of change your PR introduces:

Issues

fixes Export ActionSet Metrics in Kanister #1602

Test Plan

💪 Manual
⚡ Unit test
💚 E2E

github-actions · 2022-08-19T17:43:54Z

Thanks for submitting this pull request 🎉. The team will review it soon and get back to you.

If you haven't already, please take a moment to review our project contributing guideline and Code of Conduct document.

ewhamilton · 2022-08-23T01:47:57Z

pkg/metrics/events.go

+
+// Event describes an individual event.
+//
+// Note: The type and labels are private in order to force the use of the


I would like to see the portion of comment block here at top of this file to explain the relationship between MetricType and constructors.

ewhamilton · 2022-08-23T02:01:44Z

pkg/controller/controller.go

@@ -209,6 +213,15 @@ func (c *Controller) onAddActionSet(as *crv1alpha1.ActionSet) error {
 	if err := validate.ActionSet(as); err != nil {
 		return err
 	}
+	// ActionSet created


This code might be better placed within c.handleActionSet (after determining status is pending, before updating status to be running).

miquella

Just a few thoughts, not a complete review

miquella · 2022-08-23T14:27:36Z

pkg/metrics/metrics.go

+}
+
+// Initialize a Prometheus GaugeVec for one metric and register it
+// nolint:all   // Function is for expanding on metrics to introduce Gauges


Disabling all linters for this entire function seems rather dangerous, doesn't it?

Aside: is this function being used anywhere?

Yes it is dangerous. The function is not being used anywhere and I feel it would be safe to remove it, Can be re-written easily.

miquella · 2022-08-23T14:31:17Z

pkg/metrics/metrics.go

+	if counterVecs[e.eventType].With(e.labels) == nil {
+		return fmt.Errorf("%s Labels for %s Event Type not found", e.labels, e.eventType)
+	}


I don't think (*CounterVec).With can fail in this way, can it?

At very least, we should probably capture the returned counter and increment it below (rather than calling With a second time to have it recreate the counter).

ewhamilton · 2022-08-23T21:43:12Z

pkg/metrics/metrics.go

+	"github.com/prometheus/client_golang/prometheus"
+)
+
+var counterVecs = make(map[MetricType]*prometheus.CounterVec)


Rather than a global, it would be more idiomatic Go to have a NewCounterVecMap function that returns a (probably opaque) *struct that contains the allocation.

That pointer would be stored in Controller. And then InitAllCounterVecs and IncrementCounterVec would be methods on *CounterVecMap.

Otherwise we have a singleton global that can be harder to test and assumes that there is only ever 1 such map in an image.

ewhamilton

It looks good. I would like to see the change from global to dynamically allocated struct before approving.

If @pavannd1 wants to approve as is I won't be upset.

ihcsim · 2022-08-23T22:22:29Z

pkg/metrics/events.go

+var MetricTypeOpts = map[MetricType]MetricTypeOpt{
+	SampleCountType: {
+		EventFunc:  NewSampleCount,
+		Help:       "Sample counter to remove later",


Is this still needed?

Removed. Done.

ihcsim · 2022-08-23T22:31:35Z

pkg/metrics/events.go

+	ActionSetBackupCreatedType   MetricType = "actionset_backup_created_count"
+	ActionSetBackupCompletedType MetricType = "actionset_backup_completed_count"
+
+	ActionSetRestoreCreatedType   MetricType = "actionset_restore_created_count"
+	ActionSetRestoreCompletedType MetricType = "actionset_restore_completed_count"
+
+	ActionSetTotalCreatedType   MetricType = "actionset_total_created_count"
+	ActionSetTotalCompletedType MetricType = "actionset_total_completed_count"


The metrics name should be prefixed with kanister_ per Prometheus doc.

Instead of the _created_count suffix, we should use _created_total. _count is used for histograms and summaries types. See here.

It's also better to instrument for failures_total than completed_count, to make calculation of failure rates easier. See slide 9 & 10 of this slides and the Prometheus doc.

I don't think we need separate backup and restore metrics. They can be represented as labels (with function names as values) in the kanister_actionset_completed_total and kanister_actionset_failures_total metrics.

Small note: …_failures_total doesn't stand on its own if the intention is to calculate failure rate. A separate completed count would need to be maintained, wouldn't it?

It appears that the way the original …_completed_count was implemented would still accomplish the same goal since it includes a state that indicates whether it failed or not. This should allow the rate to be calculated as rate(…_completed_count{state="failure"}[5m]) / rate(…_completed_count[5m]).

I assume _created_count is what will be used with _failures_total to calculate failure rates. The state label works too, but then why have a separate created metric? I interpret completed as succeeded.

Yeah, I'm not sure if/what value the separate created metric provides.

But it did seem weird to me to calculate a rate based upon created vs. completed + failed (especially since there's a strange temporal dissonance there…). It seemed more logical to me to compare the number of completed vs. completed + failed.

Is completed + failed the same as "created", which is the same as "total"? In my mind, they are. As a start, I want to focus on the error rates i.e., rate(actionset_failures_total[5m]) / rate(actionset_total[5m]). It's more commonly used than non-error rates from a system alert perspective. The rate(actionset_total) is also great for detecting saturation and spikes.

At the code level, ideally, there are two places that call Inc(); one for requests_total when the actionset is initiated, and one for failures_total inside the caller's if err != nil.

How would the state label work on a counter metric, as the actionset goes through different phases throughout its lifecycle? Will there be one data point for each state?

ihcsim · 2022-08-23T22:47:31Z

pkg/metrics/events.go

+	},
+}
+
+// Event describes an individual event.


I am not sure we are gaining much with this new Event type. If I understand the intent here, it feels like a sample is closer to what you are trying to describe. It represents a specific instance value of the metric, at a given time. To me, events represent action, incident and occurrence recognized by the system such as those defined in this K8s list.

What do you think of merging Event, MetricType and MetricTypeOpt into a Metric and Sample structs? E.g.,

type Metric struct { Name string Help string LabelNames []string } type Sample struct { Name string Labels prometheus.Labels }

Here is a diff with suggestions on what the code can look like with the two types:

diff --git a/pkg/controller/controller.go b/pkg/controller/controller.go index ff5a0a10..94e65aad 100644 --- a/pkg/controller/controller.go +++ b/pkg/controller/controller.go @@ -62,7 +62,7 @@ type Controller struct { config *rest.Config crClient versioned.Interface clientset kubernetes.Interface - counterVecs map[metrics.MetricType]*prometheus.CounterVec + counterVecs map[string]*prometheus.CounterVec dynClient dynamic.Interface osClient osversioned.Interface recorder record.EventRecorder @@ -374,7 +374,8 @@ func (c *Controller) handleActionSet(as *crv1alpha1.ActionSet) (err error) { // These labels will be passed to a function in the metrics package // Also, a list of events to update metrics should be created and incremented. for _, a := range as.Spec.Actions { - if err := metrics.IncrementCounterVec(metrics.NewActionSetTotalCreated(a.Name, as.GetNamespace())); err != nil { + s := metrics.NewSampleActionSetTotal(a.Name, as.GetNamespace()) + if err := metrics.IncrementCounterVec(s); err != nil { log.Error().WithError(err).Print("Metrics Incrementation failed") } } diff --git a/pkg/metrics/events.go b/pkg/metrics/events.go index 81b341d1..65f026c6 100644 --- a/pkg/metrics/events.go +++ b/pkg/metrics/events.go @@ -1,5 +1,7 @@ package metrics +import "github.com/prometheus/client_golang/prometheus" + // This file contains wrapper functions that will map a Prometheus metric names to its // label field, help field and an associated Event. @@ -8,18 +10,37 @@ package metrics type MetricType string const ( - SampleCountType MetricType = "sample_count" - - ActionSetBackupCreatedType MetricType = "actionset_backup_created_count" - ActionSetBackupCompletedType MetricType = "actionset_backup_completed_count" + ActionSetBackupCreatedType = "actionset_backup_created_count" + ActionSetBackupCompletedType = "actionset_backup_completed_count" - ActionSetRestoreCreatedType MetricType = "actionset_restore_created_count" - ActionSetRestoreCompletedType MetricType = "actionset_restore_completed_count" + ActionSetRestoreCreatedType = "actionset_restore_created_count" + ActionSetRestoreCompletedType = "actionset_restore_completed_count" - ActionSetTotalCreatedType MetricType = "actionset_total_created_count" - ActionSetTotalCompletedType MetricType = "actionset_total_completed_count" + ActionSetTotalCreatedType = "actionset_total_created_count" + ActionSetTotalCompletedType = "actionset_total_completed_count" ) +type Metric struct { + Name string + Help string + LabelNames []string +} + +type Sample struct { + Name string + Labels prometheus.Labels +} + +func NewSampleActionSetTotal(action string, namespace string) *Sample { + return &Sample{ + Name: ActionSetTotalCompletedType, + Labels: prometheus.Labels{ + "actionType": action, + "namespace": namespace, + }, + } +} + // MetricTypeOpt is a struct for a Prometheus metric. // Help and LabelNames are passed directly to the Prometheus predefined functions. // EventFunc holds the constructor of the linked Event of a given MetricType. @@ -32,40 +53,32 @@ type MetricTypeOpt struct { // Mapping a Prometheus MetricType to the metric MetricTypeOpt struct. // Basically, a metric name is mapped to its associated Help and LabelName fields. // The linked event function (EventFunc) is also mapped to this metric name as a part of MetricTypeOpt. -var MetricTypeOpts = map[MetricType]MetricTypeOpt{ - SampleCountType: { - EventFunc: NewSampleCount, - Help: "Sample counter to remove later", - LabelNames: []string{"sample"}, - }, - +var Counters = map[string]*Metric{ ActionSetBackupCreatedType: { - EventFunc: NewActionSetBackupCreated, - Help: "The count of backup ActionSets created", + Name: ActionSetBackupCreatedType, + Help: "The count of backup ActionSets created", }, ActionSetBackupCompletedType: { - EventFunc: NewActionSetBackupCompleted, + Name: ActionSetBackupCompletedType, Help: "The count of backup ActionSets completed", LabelNames: []string{"state"}, }, - ActionSetRestoreCreatedType: { - EventFunc: NewActionSetRestoreCreated, - Help: "The count of restore ActionSets created", + Name: ActionSetRestoreCreatedType, + Help: "The count of restore ActionSets created", }, ActionSetRestoreCompletedType: { - EventFunc: NewActionSetRestoreCompleted, + Name: ActionSetRestoreCompletedType, Help: "The count of restore ActionSets completed", LabelNames: []string{"state"}, }, - ActionSetTotalCreatedType: { - EventFunc: NewActionSetTotalCreated, + Name: ActionSetTotalCreatedType, Help: "The count of total ActionSets created", LabelNames: []string{"actionType", "namespace"}, }, ActionSetTotalCompletedType: { - EventFunc: NewActionSetTotalCompleted, + Name: ActionSetTotalCompletedType, Help: "The count of total ActionSets completed", LabelNames: []string{"actionType", "namespace", "state"}, }, @@ -98,15 +111,6 @@ func (e *Event) Labels() map[string]string { return labels } -func NewSampleCount(sample string) Event { - return Event{ - eventType: SampleCountType, - labels: map[<details> <summary>Click me</summary>string]string{ - "sample": sample, - }, - } -} - func NewActionSetBackupCreated() Event { return Event{ eventType: ActionSetBackupCreatedType, diff --git a/pkg/metrics/metrics.go b/pkg/metrics/metrics.go index 17a0bb55..f76c60f9 100644 --- a/pkg/metrics/metrics.go +++ b/pkg/metrics/metrics.go @@ -8,25 +8,19 @@ import ( "github.com/prometheus/client_golang/prometheus" ) -var counterVecs = make(map[MetricType]*prometheus.CounterVec) +var counterVecs = map[string]*prometheus.CounterVec{} // Initialize a Prometheus CounterVec for one metric and register it -func initCounterVec(r prometheus.Registerer, t MetricType) (*prometheus.CounterVec, error) { - metricTypeOpts, ok := MetricTypeOpts[t] - - if !ok { - panic(fmt.Sprintf("Event type %s is not defined", t)) - } - +func initCounterVec(r prometheus.Registerer, m *Metric) (*prometheus.CounterVec, error) { opts := prometheus.CounterOpts{ - Name: string(t), - Help: metricTypeOpts.Help, + Name: string(m.Name), + Help: m.Help, } - counterVec := prometheus.NewCounterVec(opts, metricTypeOpts.LabelNames) + counterVec := prometheus.NewCounterVec(opts, m.LabelNames) err := r.Register(counterVec) if err != nil { - return nil, fmt.Errorf("%s not registered: %s ", t, err) + return nil, fmt.Errorf("%s not registered: %s ", m.Name, err) } var alreadyRegisteredErr prometheus.AlreadyRegisteredError if errors.As(err, &alreadyRegisteredErr) { @@ -39,26 +33,26 @@ func initCounterVec(r prometheus.Registerer, t MetricType) (*prometheus.CounterV } // Initialize all the Counter Vecs and save it in a map -func InitAllCounterVecs(r prometheus.Registerer) map[MetricType]*prometheus.CounterVec { - for metricType := range MetricTypeOpts { - cv, err := initCounterVec(r, metricType) +func InitAllCounterVecs(r prometheus.Registerer) map[string]*prometheus.CounterVec { + for _, counter := range Counters { + cv, err := initCounterVec(r, counter) if err != nil { log.WithError(err).Print("Failed to register metric %s") return nil } - counterVecs[metricType] = cv + counterVecs[counter.Name] = cv } return counterVecs } // Increment a Counter Vec metric -func IncrementCounterVec(e Event) error { - if counterVecs[e.eventType] == nil { - return fmt.Errorf("%s Event Type not found", e.eventType) +func IncrementCounterVec(s *Sample) error { + if counterVecs[s.Name] == nil { + return fmt.Errorf("%s Event Type not found", s) } - cv := counterVecs[e.eventType].With(e.labels) + cv := counterVecs[s.Name].With(s.Labels) if cv == nil { - return fmt.Errorf("%s Labels for %s Event Type not found", e.labels, e.eventType) + return fmt.Errorf("%s Labels for %s Event Type not found", s.Labels, s.Name) } cv.Inc() return nil

LMKWYT.

It looks like much of this was modeled after the counter metrics I added in K10, so while I can't speak much to this code, I can offer some insights into the K10 design.

There were two primary things considered in K10's design:

How can we leverage the type system to prevent the Prometheus library panicking unexpectedly?

How can we model the metrics to conceptually match what's going on in the system?

The Prometheus library has an annoying propensity to panic if a CounterVec is requested with a different set of labels than it was instantiated with. I ran into this several times while prototyping the solution, so I decided to employ the type system to help protect against this.

This meant that the developer shouldn't be able to freely associate a metric name with labels (because that was the case where we kept running into trouble and causing panics).

So the metric type had a private metric name and set of labels that were instantiated via a constructor. This meant that we only needed to ensure that the constructors worked properly, which we did through a reflection based unit test.

As for conceptual models for the metrics, I elected to use "event" over "sample" because inherently the transitionary event between a previous state and a new state (in other words, it's representing the point at which an increase happens).

This is distinct from what I think of as a "sample": the measure of the overall state at an instantaneous point in time (not the transition between two states, just the measure of the state).

To put a different way, I thought of a "sample" more as reporting the current value (i.e. a gauge), whereas "events" were the accumulation of individual changes (i.e. a counter).

So if we're going to eschew the type protections, we should recognize what the consequences might be and we should decide if the terminology aligns with conceptual models (personally, "sample" doesn't seem to make as much sense to me in this context).

Before we toss away the type system prevention of Prometheus panics, please note that the Event structure facilitates writing a unit test that validates that the labels from each constructor match the labels that are used to initialize the CounterVec. Or so I've been told :).

I don't think I am advocating that we tossed away the type system. In my diff, there is still a constructor to constrain the labels that can be passed in. E.g.,

func NewSampleActionSetTotal(action string, namespace string) *Sample { // .... }

It isn't obvious to me how my diff is hindering the type system protection you are referring to. My intention is still this:

What do you think of merging Event, MetricType and MetricTypeOpt into a Metric and Sample structs?

Re: conceptual model, I picked "sample" because it represents a data point instance in the timeseries. Yeah, I agree it doesn't align as well as I'd want, because in a TSDB, a sample of a metric time series is a k/v pair of timestamp to value (with all the label values defined). In this case, we only have the label values, not its value at that given time. Hence, I said, "closer to what you are trying to describe" 🙂. What I have in mind is like MetricData, MetricDataPoint(not saying those should be the name).

I am not convinced "event" is clearer; feels like this is where it gets subjective. I think we agree that type system protection is good, and we want some way to conceptually represent the metric and distinguish it from the metric instance at a given time, right?

Perhaps the concern with your diff is that the Name and Labels are public fields. Wouldn't we be worried about developers constructing Samples inline and not using the constructors, thus bypassing the type system?

--

I agree that I'm not convinced "event" is more clear and that it's a very subjective topic 😄 However, one additional consideration might be: is using "sample" here going to cut off the term we would want to use for gauges in the future (when we add gauge metrics)?

Unexporting the public fields sound actionable to me. Personally, I am not that worried about the panics caused by initialization with unknown labels 😄, because:

I am not convinced that they won't be caught during developer testing and code review (assuming they happen 😄)

Most common vectors methods like With() and WithLabelValues() that panic, have a non-panic Get* counterpart

I expect developers to learn and understand the packages they are using (through docs, code review, mentorship etc.), instead of relying on others to fail-proof things for them.

I am not opposed to unexporting the public fields though.

As for the usage of "sample", yeah, it's not a good choice. I prefer not to overload the TSDB definition of "sample", which is a k/v pair of timestamp to values, regardless of the metric type. I don't have a better name for it, so Event is fine (it's more generic than "sample") until we can come up with something else.

ihcsim · 2022-08-23T23:00:23Z

pkg/metrics/events.go

+	return labels
+}
+
+func NewSampleCount(sample string) Event {


Still needed?

Removed. Done.

ihcsim · 2022-08-23T23:05:20Z

pkg/metrics/metrics.go

+// Initialize all the Counter Vecs and save it in a map
+func InitAllCounterVecs(r prometheus.Registerer) map[MetricType]*prometheus.CounterVec {
+	for metricType := range MetricTypeOpts {
+		cv, err := initCounterVec(r, metricType)


Are all metricType counters? If so, can we reflect that in the MetricTypeOpts name? How about just call it Counters?

ihcsim · 2022-08-23T23:08:08Z

pkg/metrics/metrics.go

+	metricTypeOpts, ok := MetricTypeOpts[t]
+
+	if !ok {
+		panic(fmt.Sprintf("Event type %s is not defined", t))


I prefer we don't panic the controller here, just because some counters don't exist. Why not just return it as an error, and let the caller decide what to do with it (e.g., log it as an error)? See the Go code review wiki.

ihcsim · 2022-08-23T23:10:07Z

pkg/metrics/metrics.go

+	if errors.As(err, &alreadyRegisteredErr) {
+		counterVec = alreadyRegisteredErr.ExistingCollector.(*prometheus.CounterVec)
+	} else if err != nil {
+		panic(err)


We should return the error and let the higher level caller handle this. See the Go code review wiki.

ihcsim · 2022-08-23T23:19:18Z

pkg/controller/controller.go

@@ -69,7 +72,8 @@ type Controller struct {
 // New create controller for watching kanister custom resources created
 func New(c *rest.Config) *Controller {
 	return &Controller{
-		config: c,
+		config:      c,
+		counterVecs: metrics.InitAllCounterVecs(prometheus.DefaultRegisterer),


Does the controller need to hold on to this? It doesn't look like the controller is doing anything with it.

Tested this out without keeping the counters in the controller and metric reporting did not work.
More testing required on this. Or a designated Reporter() for all the metrics.

github-actions · 2022-10-25T00:18:52Z

This PR is marked as stale due to inactivity. Add a new comment to reactivate it.

github-actions · 2022-11-25T00:10:01Z

This PR is closed due to inactivity. Feel free to reopen it, if it's still relevant.

aish-where-ya requested review from ewhamilton, ihcsim, miquella and pavannd1 August 19, 2022 17:46

ewhamilton reviewed Aug 23, 2022

View reviewed changes

miquella reviewed Aug 23, 2022

View reviewed changes

ewhamilton reviewed Aug 23, 2022

View reviewed changes

ihcsim reviewed Aug 24, 2022

View reviewed changes

aish-where-ya added 8 commits August 25, 2022 01:18

Inital commit - Prometheus Metrics for Kanister

f25e3e7

Metric ActionSetBackupCreated now in a working state

7ecc7b2

Minor comment

0f0a899

Changed Metric to report any new ActionSet created

25c4f23

Fixed linter

4fcb1f4

Addressed Eric's reveiw comments

69b2dea

Addressed Mark's review comments

796c462

Addressed Ivan's review comments (partially)

424ff9d

aish-where-ya force-pushed the prometheus-metrics branch from 3fa6542 to 424ff9d Compare August 25, 2022 05:18

github-actions bot added the stale label Oct 25, 2022

github-actions bot added the rotten label Nov 25, 2022

github-actions bot closed this Nov 25, 2022

julio-lopez deleted the prometheus-metrics branch July 22, 2023 04:04

Export ActionSet Metrics for Kanister #1603

Export ActionSet Metrics for Kanister #1603

Conversation

aish-where-ya commented Aug 19, 2022 • edited Loading

Change Overview

Pull request type

Issues

Test Plan

github-actions bot commented Aug 19, 2022

ewhamilton Aug 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

miquella left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ewhamilton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ihcsim Aug 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ihcsim Aug 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ihcsim Aug 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Oct 25, 2022

github-actions bot commented Nov 25, 2022

aish-where-ya commented Aug 19, 2022 •

edited

Loading

ewhamilton Aug 23, 2022 •

edited

Loading

ihcsim Aug 23, 2022 •

edited

Loading

ihcsim Aug 25, 2022 •

edited

Loading

ihcsim Aug 23, 2022 •

edited

Loading