Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stacked 5/5] metrics: add topology-aware policy metrics collection. #406

Merged
merged 3 commits into from
Nov 20, 2024

Conversation

klihub
Copy link
Collaborator

@klihub klihub commented Nov 11, 2024

Notes: This PR is stacked on top of #405.

Implement metrics collection for the topology-aware policy. Currently we collect for each pool/zone
- name, cpuset and memset
- shared pool capacity, allocation, available amount
- memory capacity, allocation, available amount
- number of containers
- number of containers in the shared pool

@klihub klihub changed the title [4/4] metrics: add topology-aware policy metrics collection. [5/5] metrics: add topology-aware policy metrics collection. Nov 11, 2024
Copy link

@pfl pfl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

pkg/resmgr/lib/memory/zones.go Show resolved Hide resolved
@klihub klihub force-pushed the metrics/topology-aware branch 3 times, most recently from 3bb93cf to 4ea6ec6 Compare November 13, 2024 09:44
@klihub klihub marked this pull request as ready for review November 13, 2024 13:30
@klihub klihub changed the title [5/5] metrics: add topology-aware policy metrics collection. [stack: 5/5] metrics: add topology-aware policy metrics collection. Nov 13, 2024
@klihub klihub changed the title [stack: 5/5] metrics: add topology-aware policy metrics collection. [stacked 5/5] metrics: add topology-aware policy metrics collection. Nov 13, 2024
@klihub klihub force-pushed the metrics/topology-aware branch 3 times, most recently from adb67fc to d83358f Compare November 14, 2024 09:44
@klihub klihub force-pushed the metrics/topology-aware branch 9 times, most recently from de0af9d to 5e04ead Compare November 18, 2024 12:55
Copy link
Collaborator

@fmuyassarov fmuyassarov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good. There might be some minor nits but I think it is easier if we move with the it now and improve further as follow ups if required. Thank you @klihub for the great work.

Copy link
Collaborator

@askervin askervin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Huge work.

@askervin
Copy link
Collaborator

Such a nice framework. I think we can extend this easily later on to users who wish to get per-container metrics, like how many exclusive CPUs my container has, etc. And metrics overhead can by controlled by enabling/disabling this in configuration.

Hats off to @klihub.

@klihub
Copy link
Collaborator Author

klihub commented Nov 19, 2024

Such a nice framework. I think we can extend this easily later on to users who wish to get per-container metrics, like how many exclusive CPUs my container has, etc. And metrics overhead can by controlled by enabling/disabling this in configuration.

Hats off to @klihub.

Now that you mentioned, I think there's one more thing that we might need to add for computationally expensive metrics. The programmatic ability to check if a collector is enabled (probably simply by name with something like metrics.Registry.IsEnabled(name string) bool), so that expensive collectors which neither use throwaway/const metrics nor recalculate themselves during collection (but other normal operations like resource allocation and release) can know when to disable the expensive updates.

@klihub klihub force-pushed the metrics/topology-aware branch 2 times, most recently from bdf663c to 88b5ee4 Compare November 20, 2024 08:11
Implement collection of policy 'system' prometheus metrics.

We collect per each memory node
  - memory capcity
  - memory usage
  - number of containers sharing the node

We collect per each CPU core
  - allocation from that core
  - number of containers sharing the core

Signed-off-by: Krisztian Litkey <[email protected]>
Add ZoneAvailable to return the amount of available/allocatable
memory in a zone, capped by the amount of free memory in any of
the ancestors of a zone.

Signed-off-by: Krisztian Litkey <[email protected]>
Implement collection of per zone prometheus metrics.
Currently we collect for each pool/zone the following
  - name, cpuset and memset
  - shared pool capacity, allocation, available amount
  - memory capacity, allocation, available amount
  - number of containers
  - number of containers in the shared pool

Signed-off-by: Krisztian Litkey <[email protected]>
@askervin askervin merged commit 389d012 into containers:main Nov 20, 2024
3 checks passed
@klihub klihub deleted the metrics/topology-aware branch November 20, 2024 08:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants