Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial stab at facilitating /proc/stat sampling #755

Closed
wants to merge 2 commits into from

Conversation

oschaaf
Copy link
Contributor

@oschaaf oschaaf commented Feb 22, 2020

This contains a first stab at a service which can be deployed to containers to sample proc stat, and expose results in Prometheus format over http. Creating an early PR for discussion.

This also has a shell script that will build a standalone binary for the python service, and deploy/start that in the containers that run the istio-injected side car proxies. On top of that there's a few bits and pieces that might come around useful later, if we want to stream the raw time-series data to disk for later inspection.

Please see README.md for more PR details.

See README.md for contents

Signed-off-by: Otto van der Schaaf <[email protected]>
Signed-off-by: Otto van der Schaaf <[email protected]>
@istio-testing istio-testing added the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Feb 22, 2020
@googlebot googlebot added the cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. label Feb 22, 2020
@istio-testing istio-testing added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Feb 22, 2020
@istio-testing
Copy link
Contributor

Hi @oschaaf. Thanks for your PR.

I'm waiting for a istio member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I am not too familiar with this, just wondering how it compares/relates to https://github.com/prometheus/node_exporter?

@mandarjog
Copy link
Contributor

Node exporter looks very promising. It has great coverage. . @oschaaf if it supports higher sampling rate, ability to run on vms and baremetal, then we have full coverage.

@oschaaf
Copy link
Contributor Author

oschaaf commented Feb 23, 2020

Looking at node exporter, this is similar, yet different. Summarizing findings after a short exploration:

  • stats collection is independent. Node exporter ties stats collection to its networking code, no polling of /metrics means no sampling.
  • for when trying to correlate latency and metrics in time, this optionally streams the raw
    samples to storage. (independent of the metrics we're exposing over http).
  • This opens /proc/stat once and subsequently calls read()once per second.
    Node exporter repeatedly opens/reads/closes /sys/devices/system/cpu/cpuXXX/[frequency|throttling|etc].
    It seems that the overhead to get to the crown jewels (irq metrics are the goal) are determined by two factors: the rate at which the metrics are polled plus the cpu core count.

I might be overly paranoid, but on the one hand, to me it seems that doing less means less (chance of) introducing noise into the actual measurements we're doing. And also directly streaming collected samples to storage seems like a good way to make sure all captured information is retained for future reference (or at least know about it when something is lost).
OTOH, I can also see that less code is more, and not having a DIY to maintain is worth a lot . Plus having easy (opt-in) access to the full set of metrics exposed by node collector could be handy.

@mandarjog
Copy link
Contributor

@oschaaf apologies for delayed response.

I think that using an existing tool is preferred until it is proven to be an issue, like excessive cost of sampling. It would be better to send PRs to nodeexporter if excessive re-opening of files becomes an issue.

I think we should document this work, and re-open it in the future if nodeexporter is unworkable.

@oschaaf
Copy link
Contributor Author

oschaaf commented Apr 15, 2020

Closing this one, as instead #890 will bring up node_exporter when the profileMode which is introduced there gets enabled in `values.yaml'

@oschaaf oschaaf closed this Apr 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. needs-ok-to-test size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants