-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All metrics/logs labelled as subordinate 'grafana-agent/N', cannot filter or see the principal charm name/unit number #60
Comments
Hi @lathiat, Agreeing that this isn't that ideal. Let me try to address this one by one:
Let's use this ticket as a bug report where we address part 2. Does that make sense? Best, |
OK after much re-reading through the different issues a few times, it seems we have 3 separate cases to reason about here. Case 1: 1 Machine with 1 Principal, 2 SubordinatesWhen the same principal unit (e.g. ceph-mon/0) is related to two different COS subordinates, e.g. prometheus-scrape-config and grafana-agent. In that case, both subordinates may create labels or names based on the common principal unit name (ceph-mon/0), that would sometimes they would overwrite each others rules (canonical/prometheus-k8s-operator#551) At least I thought that is what this was, except it seems prometheus-scrape-config-ceph was really a principal charm in the COS model... and not a subordinate on ceph-mon. I'm not sure how the principal unit was being passed through that relation. However it mostly still applies.. the point seems to have been that the same named item was configured by two different "subordinates" (that isn't really a subordinate in this actual case, but might be in some other cases).
Case 2: 1 Machine with 2 Principals, 1 Subordinate instantiated twice (e.g. ceph-osd/2 with grafana-agent/3 and ubuntu/0 with grafana-agent/8).When the same machine has 2 different principal units installed, both of which are related to the same grafana-agent subordinate. In this case we have two different principal unit names, and 2 different subordinate unit names (e.g. principal ceph-osd/2 and ubuntu/2, subordinate grafana-agent/3 and grafana-agent/8), all on the same machine and with the same actual installation of grafana-agent.
Case 3: 2 Machines each with 1 different principal unit, which is related to the same 1 subordinate (e.g. kafka/0 and zookepr/0 both related to grafana-agent).When the same subordinate (e.g. grafana-agent) is related to multiple principal units, on different machines (#17) It's unclear to me why it was getting confused in this case, since the two principal units were on different machines. Except possibly the note that it was "difficult to get the principal unit", which I address below.
AnalysisIdentifying rules and metrics based on the subordinate name does help with Case 1, as it seems both subordinate charms would generate metrics or rules with the same name. However it now means we have lost the ability to filter based on the Principal Name. Having to work with grafana-agent/N and not being able to reason about principals like ceph-mon/0, ceph-osd/0, etc.. is a serious blocker in my view and I don't see it being a usable observability system that way. I'd like to keep the bug about that specific issue, the hostname part is related but more minor. #47 claimed that it was difficult to determine the principal unit from the charm code, however all of the existing LMA charms have long been doing this. JUJU_PRINCIPAL is generally passed into hooks though I would note that the filebeat charm at least seemed to have to cache this as it might not always be available? I don't immediately see details of when it isn't: It seems to me the real solution would be to label and name the metrics based on both the principal and subordinate, so that we can still filter for metrics on the principal, but the unique identifiers for rules etc would still be unique by having the subordinate also listed. In the case of 1 Machine with 2 Principals, we may duplicate some collections, if the data is collected twice with the two principal names, but the cardinality expansion should I think be limited, to only the number of Principal units - and we usually only have 1 maybe 2. It's very rare to have more than 2 Principal applications on the same machine. Assuming that we d It wouldn't surprise me if I have missed something here, and I am pretty green to prometheus+loki, so please let me know what I have missed, it was a bit to get my head around. But I think that the key points are that
|
If you read the above comment in an e-mail, I made a couple minor edits shortly after posting, so it would be best to read the latest version. |
So I would like to discuss logs and metrics separately. LogsLogs will support having both labels once #46 is completed. Grafana-agent will send a "standard" set of logs with it's own labels and the charm can request that specific logs files be labeled with its topology. Additionally, logs originating from snaps will get the labels of the charm that declared them. MetricsThe metrics story is a bit different. We have decided to label any metrics generated by grafana-agent (node-exporter) with grafana-agent's topology and and metrics which we scrape from the application get the application's topology. So any dashboards or rules provided should work just fine with the application labels. |
Hi @lathiat,
ReproductionTo elaborate on @dstathis's response above, it would be handy if you could post a minimal ceph-osd bundle (e.g. # lxd model
series: jammy
saas:
loki:
url: microk8s:admin/pebnote.loki
prom:
url: microk8s:admin/pebnote.prom
applications:
ga:
charm: grafana-agent
channel: edge
revision: 52
ub:
charm: ubuntu
channel: edge
revision: 24
num_units: 1
to:
- "0"
ubu:
charm: ubuntu
channel: edge
revision: 24
num_units: 1
to:
- "0"
zk:
charm: zookeeper
channel: 3/edge
revision: 125
num_units: 1
to:
- "1"
trust: true
machines:
"0":
constraints: arch=amd64
"1":
constraints: arch=amd64
relations:
- - ga:juju-info
- ub:juju-info
- - ga:juju-info
- ubu:juju-info
- - ga:logging-consumer
- loki:logging
- - ga:send-remote-write
- prom:receive-remote-write
- - ga:cos-agent
- zk:cos-agent # microk8s model
bundle: kubernetes
saas:
remote-a62e4e5eeec84aa78034f543c0218901: {}
applications:
loki:
charm: loki-k8s
channel: edge
revision: 121
resources:
loki-image: 91
scale: 1
trust: true
prom:
charm: prometheus-k8s
channel: edge
revision: 170
resources:
prometheus-image: 139
scale: 1
trust: true
relations:
- - loki:logging
- remote-a62e4e5eeec84aa78034f543c0218901:logging-consumer
- - prom:receive-remote-write
- remote-a62e4e5eeec84aa78034f543c0218901:send-remote-write
--- # overlay.yaml
applications:
loki:
offers:
loki:
endpoints:
- logging
acl:
admin: admin
prom:
offers:
prom:
endpoints:
- receive-remote-write
acl:
admin: admin Relation viewgraph LR
subgraph lxd
ub --- ga
ubu --- ga
zk --- ga
end
subgraph microk8s
prom
loki
end
ga --- prom
ga --- loki
Machine viewgraph TD
subgraph machine-0
ub/0
ubu/0
subgraph subord1[subordiantes]
ga/0
ga/1
end
ub/0 --- ga/0
ubu/0 --- ga/1
end
subgraph machine-1
zk/0
subgraph subord2[subordiantes]
ga/2
end
zk/0 --- ga/2
end
|
Just wanted to mention that scenario 2 above is only unsupported for now. We plan on supporting it in the future. |
Hi @lathiat, |
@sed-i Issue need to be reopened rev 88 on latest/edge has exactly the same issue as described here. |
Enhancement Proposal
When deploying cos-lite edge and relating it against a Ceph deployment, both the Loki logs and host metrics (e.g. CPU/Disk/etc) are labelled according to the grafana-agent subordinate.
Instead of ceph-mon/0, ceph-osd/{0,1,2} and ceph-rgw/{0,1,2} all appear in loki tagged as juju_application=grafana-agent, juju_unit=grafana-agent/{0,1,2,3,4,5,6,7}. I cannot filter for ceph-osd applications. Similarly under a grafana dashboard such as "System Resources" the hostname is {MODEL_NAME}-{MODEL_UUID}_grafana-agent_grafana-agent/7.
This is not really helpful, as a user of the system I need to be able to easily drill down or select hosts based on the application like ceph-osd, having to translate from ceph-osd/N to a sea of grafana-agent/N is not really practical.
Under the LMA stack, they would be tagged with the principal charm name instead, which is much more useful.
It seems this change was very recently made in #47 to fix another issue #17. However it is now very difficult to actually use the dashboards.
In some cases, it may be possible to resolve this by adding additional labels such as the principal application/unit, however that won't help so much for the "Hostname" side of things. So I guess some more thought into balancing this usability with the requirements of the original issue is needed.
The text was updated successfully, but these errors were encountered: