Proposal: DaemonSet-like mode for Grafana Agent Operator #1495

rfratto · 2022-03-14T20:33:24Z

Grafana Agent Operator currently requires deploying multiple sets of agents:

A StatefulSet for metrics
A DaemonSet for logs
(Eventually) A DaemonSet for integrations that should run on every node (Operator: Implement support for integrations #1414)
(Eventually) A Deployment for integrations that should not run on every node (Operator: Implement support for integrations #1414)
(Eventually) A Deployment for traces(?) ([Feature request] Allow configuration of traces in operator-spawned agent #1044)

The specific resources deployed by the operator is ideally an implementation detail to the users, but it's still not ideal that we need to do this. One side effect of the current implementation is that the requests/limits you assign to the GrafanaAgent resource are shared with all deployments of the agent. This is redundant (not all pods need the same requests/limits) and duplicative (the total resource requests are your requests * the number of pods the operator determines it needs to deploy).

I propose that Grafana Agent Operator supports a "DaemonSet-like" mode, where it manages one pod per node handling all telemetry, including integrations. We should use a "DaemonSet-like" controller to allow PVCs to be created per pod, as real DaemonSets don't support this.

As-is, this proposal isn't ready for work, and has at least a few dependencies:

We must be able to effectively scale a DaemonSet used for metrics. [RFC] Integrations in Grafana Agent Operator #1224 would be crucial for enabling this.
Integrations which should not run on every node will need some kind of sharding mechanism, either using the clustering from [RFC] Integrations in Grafana Agent Operator #1224 or some other trick.
The Operator would need a DaemonSet-like controller to manage pods per Node, reconciling when the list of Nodes changes.

Despite it not being ready, I'm opening this as a proposal now because:

I've mentioned it being a long-term goal of the Operator several times in various places but never formally wrote it down anywhere
The proposal should be open for feedback and alternative ideas for how we can trim down the set of pods deployed by the Operator, including whether it needs to change at all.

aengusrooneygrafana · 2022-03-16T11:26:36Z

adding 👀 for @grafana/solutions-engineering

mrmartan · 2022-03-16T16:20:28Z

I am currently implementing deployment of a fleet of Grafana Agents on company Kubernetes clusters using standalone/manual deployments (e.g. the Grafana Cloud provided K8s integration) and resources deployed by the agent operator. Deploying and configuring all the agents for all three observability pillars with reliability under load and at scale is anything but straightforward. There are many things one has to know. I know of some, e.g. sharding of metrics agents, load-balancing of traces agents. Even with the operator. There are many more instances where I don't know what I don't know yet.

I am wholly behind the idea presented here. It does not matter whether the implementation is DeamonSet-like or anything else. I don't think you have to restrict to yourself the notion of 'agent per node'. You can't know how big each node is and whether you can vertically scale one agent instance to handle all load. That said, the operator could integrate with cadvisor and kube-state-metrics or similar and use data from them to scale agents accordingly.

The specific resources deployed by the operator is ideally an implementation detail to the users

That should be true but I don't feel it is. I have to understand what is happening under the hood to be able to scale.

Ideally I wouldn't want to deal with GrafanaAgent and LogsInstance/MetricsInstance at all. I want to define monitors, PodMonitor, ServiceMonitor, LogsMonitor, TraceMonitor and that's it. Although I might be asking too much here 😄

rfratto · 2022-04-13T15:32:22Z

This proposal would be superseded by #1565.

james-callahan · 2023-02-10T05:54:01Z

An application I'd like this for is being able to scrape the local kubelet metrics on each node. Currently you need your MetricsInstance to be able to reach the host network of all nodes, which can be problematic in certain environments.

Potentially solving this issue would let you deprecate/remove the kubelet service thing.

Note that you should be able to specify for a given DaemonSet if you want hostNetwork.

rfratto added operator Grafana Agent Operator related proposal Proposal or RFC labels Mar 14, 2022

rfratto mentioned this issue Mar 31, 2022

Questions: Grafana-agent-operator created component resource limit #1548

Closed

rfratto mentioned this issue Apr 21, 2022

grafana-agent-operator: support taints for DaemonSet-based scraping workloads #1636

Closed

rfratto added this to Grafana Agent (Public) Oct 3, 2022

rfratto added the area/operator label Oct 3, 2022

rfratto removed the type/operator label Nov 2, 2023

rfratto added the variant/operator Related to Grafana Agent Static Operator. label Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: DaemonSet-like mode for Grafana Agent Operator #1495

Proposal: DaemonSet-like mode for Grafana Agent Operator #1495

rfratto commented Mar 14, 2022 •

edited

Loading

aengusrooneygrafana commented Mar 16, 2022

mrmartan commented Mar 16, 2022

rfratto commented Apr 13, 2022

james-callahan commented Feb 10, 2023 •

edited

Loading

Proposal: DaemonSet-like mode for Grafana Agent Operator #1495

Proposal: DaemonSet-like mode for Grafana Agent Operator #1495

Comments

rfratto commented Mar 14, 2022 • edited Loading

aengusrooneygrafana commented Mar 16, 2022

mrmartan commented Mar 16, 2022

rfratto commented Apr 13, 2022

james-callahan commented Feb 10, 2023 • edited Loading

rfratto commented Mar 14, 2022 •

edited

Loading

james-callahan commented Feb 10, 2023 •

edited

Loading