Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for configuring Grafana Mimir via PrometheusRule CRDs #2604

Merged
merged 40 commits into from
Jan 6, 2023

Conversation

Logiraptor
Copy link
Contributor

@Logiraptor Logiraptor commented Dec 9, 2022

PR Description

This PR adds a new flow component: mimir.rules.kubernetes. This was my hackathon project for the December Grafana Labs hackathon. (I'm on the Mimir squad). Here's a list of items I managed to get done this week:

  • Live config reload works
  • Efficient use of the k8s api (informer pattern)
  • Rate limited (both overall and per-item for backoff) interaction with Mimir
  • Debug info available in agent UI
  • Able to use many agents without clobbering each other’s rules
  • Able to still use Grafana Cloud integration rules
  • Metrics available for work queue stats, k8s response time, mimir response time
  • Supports all methods of authentication with Mimir / GEM / Grafana cloud including anonymous, api key, basic auth, and oidc
  • Supports all TLS settings on the mimir client side

Which issue(s) this PR fixes

Fixes #1544
Fixes grafana/mimir#2133
Related to #523 and #1986

Notes to the Reviewer

I added a few tests including some integration tests for the core event loop, but please let me know if there are more tests I should add. Notably, some of the scaffolding code is currently untested.

Also missing as of now is documentation. Happy to contribute there, if you have any pointers?

PR Checklist

  • CHANGELOG updated
  • Documentation added
  • Tests updated
  • @Logiraptor to resolve the licensing issue for the Mimir Client code copied from github.com/grafana/mimir

@CLAassistant
Copy link

CLAassistant commented Dec 9, 2022

CLA assistant check
All committers have signed the CLA.

pkg/mimir/client/client.go Outdated Show resolved Hide resolved
Copy link
Member

@rfratto rfratto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is off to a great start, and I'm really excited to have this component.

We'll want a CHANGELOG entry and reference documentation before we merge. Writing documentation is usually the most time consuming part of new components since we try to explain as much as users need to know in the reference docs.

I added a few tests including some integration tests for the core event loop, but please let me know if there are more tests I should add. Notably, some of the scaffolding code is currently untested.

This will be a judgment call by you; if you think what you have tested is enough to have confidence in the correctness of the code (and any untested code is obviously correct), then you don't need to add any other tests :)

component/mimir/rules/debug.go Outdated Show resolved Hide resolved
component/mimir/rules/debug.go Outdated Show resolved Hide resolved
component/mimir/rules/rules.go Outdated Show resolved Hide resolved
component/mimir/rules/types.go Outdated Show resolved Hide resolved
component/mimir/rules/types.go Outdated Show resolved Hide resolved
component/mimir/rules/types.go Outdated Show resolved Hide resolved
component/mimir/rules/rules.go Outdated Show resolved Hide resolved
component/mimir/rules/rules.go Outdated Show resolved Hide resolved
component/mimir/rules/rules.go Outdated Show resolved Hide resolved
component/mimir/rules/diff.go Outdated Show resolved Hide resolved
@Logiraptor Logiraptor force-pushed the hackathon-2022-12-mimir-crds branch from f9f15a6 to ac615ce Compare December 13, 2022 23:43
@Logiraptor Logiraptor marked this pull request as ready for review December 14, 2022 00:09
@Logiraptor Logiraptor marked this pull request as draft December 14, 2022 00:09
Copy link
Member

@rfratto rfratto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 I'm happy with this. We still need a few things before it can be merged:

  • Changelog entry
  • Documentation

Other than that, looks great, thank you for working on this!

component/mimir/rules/diff.go Outdated Show resolved Hide resolved
component/mimir/rules/diff.go Outdated Show resolved Hide resolved
component/mimir/rules/events.go Outdated Show resolved Hide resolved
component/mimir/rules/events.go Outdated Show resolved Hide resolved
component/mimir/rules/events.go Outdated Show resolved Hide resolved
component/mimir/rules/events.go Outdated Show resolved Hide resolved
@Logiraptor Logiraptor marked this pull request as ready for review January 5, 2023 21:57
@Logiraptor
Copy link
Contributor Author

Ok, @rfratto I've added a first pass at reference documentation based on some of the other docs I found. I think at this point it should be good. Let me know if anything else needs fixing!

@Logiraptor Logiraptor requested a review from rfratto January 5, 2023 22:49
Copy link
Member

@rfratto rfratto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 fantastic. I'll leave the final doc review to @karengermond, but once that's done, we'll get this merged!

Copy link
Member

@rfratto rfratto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, one extra thing :)

Copy link
Contributor

@karengermond karengermond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, aside from minor edits.

@Logiraptor
Copy link
Contributor Author

@rfratto All comments have been addressed 👍

@rfratto rfratto merged commit 3d62816 into main Jan 6, 2023
@rfratto rfratto deleted the hackathon-2022-12-mimir-crds branch January 6, 2023 21:15
@nicoche
Copy link
Contributor

nicoche commented Jan 9, 2023

Hi there!

Congrats on the work, I can't wait to run this in production 🙂. I've tried it on my lab, it worked great. Two things:

1 : I ran into an issue where the component would not boot correctly. cf. screenshot, it's neither healthy nor unhealthy, it did not finish booting. The issue was that the service account I used for Grafana Agent missed some capabilities. I used the one mentioned in the documentation (https://grafana.com/docs/grafana-cloud/kubernetes-monitoring/other-methods/k8s-agent-metrics/#deploy-grafana-agent-resources). Here is the diff for it to work:

 apiVersion: rbac.authorization.k8s.io/v1
 kind: ClusterRole
 metadata:
   name: grafana-agent
 rules:
+ - apiGroups: # This group is needed to list PrometheusRules
+  - "monitoring.coreos.com"
+  resources:
+  - prometheusrules
+  verbs:
+   - '*'
 - apiGroups:
   - ""
   resources:
   - nodes
   - nodes/proxy
   - nodes/metrics
   - services
   - endpoints
   - pods
   - events
+   - namespaces # This was missing to list namespaces
 (...)

image

This call blocks when the service account has not enough privileges: https://github.com/grafana/agent/blob/main/component/mimir/rules/kubernetes/rules.go#L304.

I believe that we should:

  • At least edit the documentation to mention the permissions needed
  • Potentially change the code such that we enforce a timeout on the blocking call or verify the service account permissions when booting the component.

wdyt @rfratto @Logiraptor ?

2: I'm down to work on the AlertManager CRD integration. I believe that the approach could be largely similar to what has been done for PrometheusRules. wdyt? Shall I open a dedicated issue for that? On mimir and/or agent?

@rfratto
Copy link
Member

rfratto commented Jan 9, 2023

I believe that we should:

  • At least edit the documentation to mention the permissions needed
  • Potentially change the code such that we enforce a timeout on the blocking call or verify the service account permissions when booting the component.

Definitely both great ideas 👍

It might also be worth mentioning that the CRDs need to be installed in the cluster, though maybe that's obvious.

2: I'm down to work on the AlertManager CRD integration. I believe that the approach could be largely similar to what has been done for PrometheusRules. wdyt? Shall I open a dedicated issue for that? On mimir and/or agent?

Contributions here would be fantastic :) Creating an issue in the agent repo to track sounds good to me.

@nicoche
Copy link
Contributor

nicoche commented Jan 14, 2023

Gotcha, I popped an issue in the agent: grafana/alloy#504. I'll try to open a PR in the coming days

@rafilkmp3
Copy link

rafilkmp3 commented May 24, 2023

@Logiraptor Can you provide an example creating this integration? I moved from Prometheus to Prometheus-agent and I need a way to make Mimir use my cluster prometheus rules and use mimir alert-manager to send alerts

@Logiraptor
Copy link
Contributor Author

@rafilkmp3 Have you checked the docs? https://grafana.com/docs/agent/latest/flow/reference/components/mimir.rules.kubernetes/#example

That should help point you in the right direction

@rafilkmp3
Copy link

@Logiraptor Yes but I can't find a way to make this work using grafana-agent-operator

this is my configuration already with additional rbac

---
apiVersion: monitoring.grafana.com/v1alpha1
kind: GrafanaAgent
metadata:
  name: grafana-agent
  namespace: {{ .Release.Namespace }}
  labels:
    app: grafana-agent
spec:
  logLevel: info
  serviceAccountName: grafana-agent
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: eks.amazonaws.com/compute-type
                operator: NotIn
                values:
                  - fargate

  logs:
    instanceSelector:
      matchLabels:
        agent: grafana-agent

  integrations:
    selector:
      matchLabels:
        agent: grafana-agent

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: grafana-agent
  namespace: {{ .Release.Namespace }}

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: grafana-agent
  namespace: {{ .Release.Namespace }}
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  - nodes/proxy
  - nodes/metrics
  - services
  - endpoints
  - pods
  - events
  - namespaces
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - networking.k8s.io
  resources:
  - ingresses
  verbs:
  - get
  - list
  - watch
- nonResourceURLs:
  - /metrics
  - /metrics/cadvisor
  verbs:
  - get
- apiGroups: # This group is needed to list PrometheusRules
  - "monitoring.coreos.com"
  resources:
  - prometheusrules
  verbs:
  - '*'
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: grafana-agent
  namespace: {{ .Release.Namespace }}
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: grafana-agent
subjects:
  - kind: ServiceAccount
    name: grafana-agent
    namespace: {{ .Release.Namespace }}

---
apiVersion: monitoring.grafana.com/v1alpha1
kind: Integration
metadata:
  labels:
    agent: grafana-agent
  name: agent-eventhandler
  namespace: {{ .Release.Namespace }}
spec:
  config:
    cache_path: /etc/eventhandler/eventhandler.cache
    logs_instance: {{ .Release.Namespace }}/grafana-agent-logs
  name: eventhandler
  type:
    unique: true
  volumeMounts:
    - mountPath: /etc/eventhandler
      name: agent-eventhandler
  volumes:
    - name: agent-eventhandler
      persistentVolumeClaim:
        claimName: agent-eventhandler

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: agent-eventhandler
  namespace: {{ .Release.Namespace }}
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1G

Can you help me know where to add this section ?

mimir.rules.kubernetes "default" {
    address = "GRAFANA_CLOUD_METRICS_URL"
    basic_auth {
        username = "GRAFANA_CLOUD_USER"
        password = "GRAFANA_CLOUD_API_KEY"
        // Alternatively, load the password from a file:
        // password_file = "GRAFANA_CLOUD_API_KEY_PATH"
    }
}

@Logiraptor
Copy link
Contributor Author

The agent operator doesn't support grafana agent flow as far as I know. It's really two separate modes of using the agent. I'd have to defer to someone on the agent team to help from here, as this was a cross-team hackathon project for me and I'm not actually working on the agent full time.

Pinging @rfratto to help!

@rfratto
Copy link
Member

rfratto commented May 25, 2023

Unfortunately, Grafana Agent Operator is focused around static mode, which doesn't support PrometheusRule CRDs.

Grafana Agent Flow is currently the only Grafana Agent variant which supports reading PrometheusRule CRDs, and we don't have any plans right now to add support for it to static mode or the static mode operator.

@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 24, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Projects
None yet
6 participants