Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(grafana): relative CPU usage #17401

Merged
merged 8 commits into from
Jun 27, 2024
Merged

chore(grafana): relative CPU usage #17401

merged 8 commits into from
Jun 27, 2024

Conversation

CAJan93
Copy link
Contributor

@CAJan93 CAJan93 commented Jun 21, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

We get alerts that are relative to the CPU limit of the container. Let's add a panel that reflects this CPU usage

Screenshot 2024-06-21 at 15 28 27

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

@tabVersion tabVersion changed the title relative CPU usage chore(grafana): relative CPU usage Jun 21, 2024
Comment on lines 29 to 30
./risedev configure # make sure that grafana is turned on
./risedev dev full
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess update means there already has been one grafana running 🤔 no need to call risedev full to start a new cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will undo this change. I added it here, because I did not have a cluster running 😁

@Li0k
Copy link
Contributor

Li0k commented Jun 21, 2024

What's the difference with the old CPU usage metrics?

f"sum(rate({metric('process_cpu_seconds_total')}[$__rate_interval])) by ({COMPONENT_LABEL}, {NODE_LABEL}) / avg({metric('process_cpu_core_num')}) by ({COMPONENT_LABEL}, {NODE_LABEL}) > 0",
"cpu usage (avg per core) - {{%s}} @ {{%s}}"
% (COMPONENT_LABEL, NODE_LABEL),
),
],
),
panels.timeseries_cpu(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is depend on k8s metrics, that means we need a flag to ensure it only can be used in Cloud ENV

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me look into that... So far I was fine with the way it is. In the risedev env it will simply show no data

@Nebulazhang
Copy link
Contributor

Nebulazhang commented Jun 23, 2024

What's the difference with the old CPU usage metrics?

This panel is the CPU usage utilization. For example, we allocated 4C for the CN, but CN used 3C, that means the utilization is 75%.

@CAJan93
Copy link
Contributor Author

CAJan93 commented Jun 26, 2024

What's the difference with the old CPU usage metrics?

@Nebulazhang is right here. We also define the high CPU usage alerts (see here) as

(
  sum(rate(container_cpu_usage_seconds_total{namespace=~"rwc-.*", pod=~"risingwave-compute-default.*", container="compute"}[5m])) by (namespace, pod)
) / 
(
  sum(kube_pod_container_resource_limits{namespace=~"rwc-.*", pod=~"risingwave-compute-default.*", container="compute", resource="cpu"}) by(namespace, pod) 
) * 100 > 80

I think it would be helpful to see the same numbers in the dev dashboard when debugging.

Copy link
Contributor

@Nebulazhang Nebulazhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@CAJan93
Copy link
Contributor Author

CAJan93 commented Jun 26, 2024

@tabVersion The panel will not work in non-k8s environments (e.g. risedev). In those cases it will show that no data is available. I added a description that points that out. Are you fine with that? If not we would need some mechanism to hide panels without data, but I feel like this is overkill 😆

@CAJan93 CAJan93 added this pull request to the merge queue Jun 26, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 26, 2024
@CAJan93 CAJan93 added this pull request to the merge queue Jun 26, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 26, 2024
@CAJan93 CAJan93 added this pull request to the merge queue Jun 26, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 26, 2024
@CAJan93 CAJan93 added this pull request to the merge queue Jun 27, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 27, 2024
@CAJan93 CAJan93 enabled auto-merge June 27, 2024 06:55
@CAJan93 CAJan93 added this pull request to the merge queue Jun 27, 2024
Merged via the queue into main with commit f75a8aa Jun 27, 2024
31 checks passed
@CAJan93 CAJan93 deleted the cajan93/cpu-relative branch June 27, 2024 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants