Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-39417: Add service account and token for service monitoring #613

Merged

Conversation

Vincent056
Copy link

Adding a service account and token needed for ServiceMonitoring, will create a new service account compliance-operator-metrics, and make a metric token for that ServiceAccount, and we will use that token for ServiceMonitoring.

@openshift-ci-robot
Copy link
Collaborator

@Vincent056: This pull request references Jira Issue OCPBUGS-39417, which is invalid:

  • expected the bug to target the "4.18.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Adding a service account and token needed for ServiceMonitoring, will create a new service account compliance-operator-metrics, and make a metric token for that ServiceAccount, and we will use that token for ServiceMonitoring.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

github-actions bot commented Sep 3, 2024

🤖 To deploy this PR, run the following command:

make catalog-deploy CATALOG_IMG=ghcr.io/complianceascode/compliance-operator-catalog:613-8c0c61d012a8f8dd61bac7656d4f1eaa317d6910

Copy link

github-actions bot commented Sep 3, 2024

🤖 To deploy this PR, run the following command:

make catalog-deploy CATALOG_IMG=ghcr.io/complianceascode/compliance-operator-catalog:613-878149a98da2174ebbfaa1e9f8589036810f22b3

Copy link

github-actions bot commented Sep 3, 2024

🤖 To deploy this PR, run the following command:

make catalog-deploy CATALOG_IMG=ghcr.io/complianceascode/compliance-operator-catalog:613-2969604ddf28fd832b367f42f88cf8911c987285

@rhmdnd rhmdnd added this to the 1.6.0 milestone Sep 3, 2024
@rhmdnd
Copy link

rhmdnd commented Sep 3, 2024

Note for reviewers - we pulled in new prometheus dependencies in #491 which needed these changes.

@rhmdnd
Copy link

rhmdnd commented Sep 3, 2024

/jira refresh

@openshift-ci-robot
Copy link
Collaborator

@rhmdnd: This pull request references Jira Issue OCPBUGS-39417, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.0) matches configured target version for branch (4.18.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @xiaojiey

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rhmdnd
Copy link

rhmdnd commented Sep 3, 2024

Looks like we have some cleanup we need to handle in the tests:

 2024/09/03 19:47:28 failed to delete Secret: secrets "metrics-token" not found
FAIL	github.com/ComplianceAsCode/compliance-operator/tests/e2e/parallel	606.572s
FAIL

@xiaojiey
Copy link
Collaborator

xiaojiey commented Sep 4, 2024

@Vincent056 With this PR, I cannot reproduce the failure logs in cov1.5.1(seen logs after 2024-09-04T09:29:20). But I do have found some other issues.

  1. I tried to fist install the released cov1.5.1 and the warning "AlertmanagerReceiversNotConfigured" fired. Then when I uninstall the CO and reinstall CO with ghcr.io/complianceascode/compliance-operator-catalog:613-2969604ddf28fd832b367f42f88cf8911c987285 index image, the warning won't be cleared. Could you please take a look? Thanks. I am wondering if there is same issue when customer upgrade from cov1.5.1 to higher version.
  2. There are more errors in k8s pods. Might related with @rhmdnd 's comment above.
% oc logs --selector prometheus=k8s --all-containers -n openshift-monitoring
...
level=info ts=2024-09-04T06:11:14.210945035Z caller=reloader.go:270 msg="reloading via HTTP"
ts=2024-09-04T09:52:42.364Z caller=klog.go:108 level=warn component=k8s_client_runtime func=Warningf msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:556: failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" in the namespace \"openshift-compliance\""
ts=2024-09-04T09:52:42.365Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:556: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" in the namespace \"openshift-compliance\""
ts=2024-09-04T09:53:01.051Z caller=klog.go:108 level=warn component=k8s_client_runtime func=Warningf msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:555: failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"services\" in API group \"\" in the namespace \"openshift-compliance\""
ts=2024-09-04T09:53:01.052Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:555: Failed to watch *v1.Service: failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"services\" in API group \"\" in the namespace \"openshift-compliance\""
ts=2024-09-04T09:53:37.853Z caller=klog.go:108 level=warn component=k8s_client_runtime func=Warningf msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:554: failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" in the namespace \"openshift-compliance\""
ts=2024-09-04T09:53:37.853Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:554: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" in the namespace \"openshift-compliance\""
ts=2024-09-04T09:53:41.702Z caller=klog.go:108 level=warn component=k8s_client_runtime func=Warningf msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:556: failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" in the namespace \"openshift-compliance\""
ts=2024-09-04T09:53:41.702Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:556: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" in the namespace \"openshift-compliance\""
ts=2024-09-04T09:53:44.860Z caller=klog.go:108 level=warn component=k8s_client_runtime func=Warningf msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:555: failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"services\" in API group \"\" in the namespace \"openshift-compliance\""
ts=2024-09-04T09:53:44.860Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:555: Failed to watch *v1.Service: failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"services\" in API group \"\" in the namespace \"openshift-compliance\""

Copy link

github-actions bot commented Sep 5, 2024

🤖 To deploy this PR, run the following command:

make catalog-deploy CATALOG_IMG=ghcr.io/complianceascode/compliance-operator-catalog:613-6521a99aa5f53fe9dc7c258539827cdb77e1795b

@xiaojiey
Copy link
Collaborator

xiaojiey commented Sep 5, 2024

@Vincent056 With this PR, I cannot reproduce the failure logs in cov1.have found some other issues. However, I can still see the "AlertmanagerReceiversNotConfigured" warning alert on GUI with a fresh install compliance operator with index image ghcr.io/complianceascode/compliance-operator-catalog:613-6521a99aa5f53fe9dc7c258539827cdb77e1795b. I haven't figured out whether it is related to CO or not. I will update it later.

% oc logs --selector app.kubernetes.io/name=prometheus-operator --all-containers -n openshift-monitoring | grep -i compliance
% oc logs --selector prometheus=k8s --all-containers -n openshift-monitoring | grep -i compliance
% oc get servicemonitor metrics -o=jsonpath='{.spec.endpoints[*]}' | jq -r
{
  "port": "metrics"
}
{
  "authorization": {
    "credentials": {
      "key": "token",
      "name": "metrics-token"
    },
    "type": "Bearer"
  },
  "path": "/metrics-co",
  "port": "metrics-co",
  "scheme": "https",
  "tlsConfig": {
    "ca": {},
    "caFile": "/etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt",
    "cert": {},
    "serverName": "metrics.openshift-compliance.svc"
  }
}
 % oc get secret metrics-token -n openshift-compliance           
NAME            TYPE                                  DATA   AGE
metrics-token   kubernetes.io/service-account-token   4      61m

Copy link

github-actions bot commented Sep 5, 2024

🤖 To deploy this PR, run the following command:

make catalog-deploy CATALOG_IMG=ghcr.io/complianceascode/compliance-operator-catalog:613-8071c280dd2394598f9b7862d86a3f73abc121c0

Copy link

github-actions bot commented Sep 5, 2024

🤖 To deploy this PR, run the following command:

make catalog-deploy CATALOG_IMG=ghcr.io/complianceascode/compliance-operator-catalog:613-b0626374d3474cae7065809df18b77a02b4deb83

@Vincent056
Copy link
Author

/retest

@Vincent056 Vincent056 force-pushed the metric_token branch 2 times, most recently from 9d66805 to 73ffcac Compare September 9, 2024 07:49
Copy link

github-actions bot commented Sep 9, 2024

🤖 To deploy this PR, run the following command:

make catalog-deploy CATALOG_IMG=ghcr.io/complianceascode/compliance-operator-catalog:613-9d66805687bd0b3aa64c137a70a840836f274b41

Copy link

github-actions bot commented Sep 9, 2024

🤖 To deploy this PR, run the following command:

make catalog-deploy CATALOG_IMG=ghcr.io/complianceascode/compliance-operator-catalog:613-73ffcacc31a31b647d95b21883d980efccb8889c

@Vincent056
Copy link
Author

/retest

Copy link

github-actions bot commented Sep 9, 2024

🤖 To deploy this PR, run the following command:

make catalog-deploy CATALOG_IMG=ghcr.io/complianceascode/compliance-operator-catalog:613-42116198f5649963acc70cbbb34cc3fc83ad945b

Adding service account and token needed for ServiceMonitoring, this will create a new service account compliance-operator-metrics and use create a metric token, and we will use that token for ServiceMonitoring.
Copy link

github-actions bot commented Sep 9, 2024

🤖 To deploy this PR, run the following command:

make catalog-deploy CATALOG_IMG=ghcr.io/complianceascode/compliance-operator-catalog:613-0ecb94e61c518fd40c7fd82cf93965f333b00ba5

Copy link

github-actions bot commented Sep 9, 2024

🤖 To deploy this PR, run the following command:

make catalog-deploy CATALOG_IMG=ghcr.io/complianceascode/compliance-operator-catalog:613-53656f1b0b09acb9fc56509cfdca2dcbf70f9ead


for _, metric := range metrics {
if metric.Health != "up" {
return fmt.Errorf("Metric %s is not up. LastError: %s", metric.Labels, metric.LastError)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether it is a good test case for the bug OCPBUGS-39417. With the last released version cov1.5.1, I can see all targets' health status are up.
% oc get csv
NAME DISPLAY VERSION REPLACES PHASE
compliance-operator.v1.5.1 Compliance Operator 1.5.1 compliance-operator.v1.5.0 Succeeded
% oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" https://prometheus-k8s.openshift-monitoring.svc.cluster.local:9091/api/v1/targets | jq -r >> without_PR_613.log
% cat without_PR_613.log| grep -i '"health":'| grep -Ev up
%

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that might be because the bug manifested on version 1.5.1. I was using 1.5.0 on a cluster and noticed those metrics endpoints are up.

@xiaojiey
Copy link
Collaborator

@Vincent056 I am little confused with this bug, about what does the token in the servicemonitor used for? Actually except the error logs in openshift-monitoring, I didn't see any other effect as the metrics and alerts could be shown normally on the web console.

@Vincent056
Copy link
Author

@Vincent056 I am little confused with this bug, about what does the token in the servicemonitor used for? Actually except the error logs in openshift-monitoring, I didn't see any other effect as the metrics and alerts could be shown normally on the web console.

I think token is required to create service monitoring, we used to use the token mounted in the pod file, but it is no longer supported with the new Prmethus dependency.

@rhmdnd
Copy link

rhmdnd commented Sep 11, 2024

I was able to experiment with this PR in a cluster and did the following:

  1. Created a new service account (oc create sa -n openshift-compliance prometheus-test)
  2. Grant service account permissions (oc adm policy add-cluster-role-to-user cluster-monitoring-view -z prometheus-test -n openshift-compliance)
  3. Fetched metrics using the following command (from Vincent's test)
oc run -i --rm --restart=Never --image=registry.fedoraproject.org/fedora-minimal:latest -n openshift-compliance metrics-test --overrides='{"spec": {"serviceAccountName": "prometheus-test"}}' -- bash -c 'TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token) && curl -k -s https://prometheus-k8s.openshift-monitoring.svc.cluster.local:9091/api/v1/targets --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca-crt -H "Authorization: Bearer $TOKEN" -H "Accept: application/json"'
  1. Scrubbed the output to be valid JSON (truncate the last 25 characters or so)
  2. Checked the endpoint health
$ cat pr-metrics.json| jq '.data.activeTargets[]|select(.labels.namespace == "openshift-compliance") .health'
"up"
"up"

I repeated those same steps with CO 1.5.1 and I wasn't able to find any metrics for the openshift-compliance namespace.

@xiaojiey Do those steps also work for you?

@rhmdnd
Copy link

rhmdnd commented Sep 11, 2024

I was able to verify the metrics on version 1.5.0 using the same steps noted above, and confirmed they're up to.

const prometheusCommand = `TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token) && { curl -k -s https://prometheus-k8s.openshift-monitoring.svc.cluster.local:9091/api/v1/targets --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt -H "Authorization: Bearer $TOKEN"; }`
namespace := f.OperatorNamespace
out, err := runOCandGetOutput([]string{
"run", "--rm", "-i", "--restart=Never", "--image=registry.fedoraproject.org/fedora:latest",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably get away with only fedora-minimal here, but we can patch that after.

@xiaojiey
Copy link
Collaborator

@Vincent056 I was able to reproduce the issue in the bug with 4.15 payload. With this PR, the issue gets resolved. The targets are up, and the metrics/alerts working well. What's more, the fix works for all ocp releases.

% token=`oc create token prometheus-k8s -n openshift-monitoring`
% oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" https://prometheus-k8s.openshift-monitoring.svc.cluster.local:9091/api/v1/targets | jq -r  > target.json
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 8870k    0 8870k    0     0  6771k      0 --:--:--  0:00:01 --:--:-- 6771k
% cat target.json| jq '.data.activeTargets[]|select(.labels.namespace == "openshift-compliance") .health'
"up"
"up"

@xiaojiey
Copy link
Collaborator

/label qe-approved

@openshift-ci-robot
Copy link
Collaborator

@Vincent056: This pull request references Jira Issue OCPBUGS-39417, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.0) matches configured target version for branch (4.18.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @xiaojiey

In response to this:

Adding a service account and token needed for ServiceMonitoring, will create a new service account compliance-operator-metrics, and make a metric token for that ServiceAccount, and we will use that token for ServiceMonitoring.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from xiaojiey September 12, 2024 06:30
Copy link

@rhmdnd rhmdnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link

openshift-ci bot commented Sep 12, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rhmdnd, Vincent056

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit fed54b4 into ComplianceAsCode:master Sep 12, 2024
16 checks passed
@openshift-ci-robot
Copy link
Collaborator

@Vincent056: Jira Issue OCPBUGS-39417: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-39417 has been moved to the MODIFIED state.

In response to this:

Adding a service account and token needed for ServiceMonitoring, will create a new service account compliance-operator-metrics, and make a metric token for that ServiceAccount, and we will use that token for ServiceMonitoring.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants