Initial stab at facilitating /proc/stat sampling #755

oschaaf · 2020-02-22T14:13:16Z

This contains a first stab at a service which can be deployed to containers to sample proc stat, and expose results in Prometheus format over http. Creating an early PR for discussion.

This also has a shell script that will build a standalone binary for the python service, and deploy/start that in the containers that run the istio-injected side car proxies. On top of that there's a few bits and pieces that might come around useful later, if we want to stream the raw time-series data to disk for later inspection.

Please see README.md for more PR details.

See README.md for contents Signed-off-by: Otto van der Schaaf <[email protected]>

Signed-off-by: Otto van der Schaaf <[email protected]>

istio-testing · 2020-02-22T14:13:31Z

Hi @oschaaf. Thanks for your PR.

I'm waiting for a istio member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

howardjohn

Sorry I am not too familiar with this, just wondering how it compares/relates to https://github.com/prometheus/node_exporter?

mandarjog · 2020-02-22T23:54:51Z

Node exporter looks very promising. It has great coverage. . @oschaaf if it supports higher sampling rate, ability to run on vms and baremetal, then we have full coverage.

oschaaf · 2020-02-23T23:19:46Z

Looking at node exporter, this is similar, yet different. Summarizing findings after a short exploration:

stats collection is independent. Node exporter ties stats collection to its networking code, no polling of /metrics means no sampling.
for when trying to correlate latency and metrics in time, this optionally streams the raw
samples to storage. (independent of the metrics we're exposing over http).
This opens /proc/stat once and subsequently calls read()once per second.
Node exporter repeatedly opens/reads/closes /sys/devices/system/cpu/cpuXXX/[frequency|throttling|etc].
It seems that the overhead to get to the crown jewels (irq metrics are the goal) are determined by two factors: the rate at which the metrics are polled plus the cpu core count.

I might be overly paranoid, but on the one hand, to me it seems that doing less means less (chance of) introducing noise into the actual measurements we're doing. And also directly streaming collected samples to storage seems like a good way to make sure all captured information is retained for future reference (or at least know about it when something is lost).
OTOH, I can also see that less code is more, and not having a DIY to maintain is worth a lot . Plus having easy (opt-in) access to the full set of metrics exposed by node collector could be handy.

mandarjog · 2020-03-09T19:15:35Z

@oschaaf apologies for delayed response.

I think that using an existing tool is preferred until it is proven to be an issue, like excessive cost of sampling. It would be better to send PRs to nodeexporter if excessive re-opening of files becomes an issue.

I think we should document this work, and re-open it in the future if nodeexporter is unworkable.

oschaaf · 2020-04-15T21:51:44Z

Closing this one, as instead #890 will bring up node_exporter when the profileMode which is introduced there gets enabled in `values.yaml'

oschaaf added 2 commits February 22, 2020 14:49

Initial stab at facilitating /proc/stat sampling

eaf8419

See README.md for contents Signed-off-by: Otto van der Schaaf <[email protected]>

A few small tweaks

391ad12

Signed-off-by: Otto van der Schaaf <[email protected]>

istio-testing added the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Feb 22, 2020

googlebot added the cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. label Feb 22, 2020

istio-testing added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Feb 22, 2020

istio-testing added the needs-ok-to-test label Feb 22, 2020

oschaaf requested review from mandarjog and carolynhu February 22, 2020 14:13

howardjohn reviewed Feb 22, 2020

View reviewed changes

oschaaf closed this Apr 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial stab at facilitating /proc/stat sampling #755

Initial stab at facilitating /proc/stat sampling #755

oschaaf commented Feb 22, 2020

istio-testing commented Feb 22, 2020

howardjohn left a comment

mandarjog commented Feb 22, 2020

oschaaf commented Feb 23, 2020 •

edited

Loading

mandarjog commented Mar 9, 2020

oschaaf commented Apr 15, 2020

Initial stab at facilitating /proc/stat sampling #755

Initial stab at facilitating /proc/stat sampling #755

Conversation

oschaaf commented Feb 22, 2020

istio-testing commented Feb 22, 2020

howardjohn left a comment

Choose a reason for hiding this comment

mandarjog commented Feb 22, 2020

oschaaf commented Feb 23, 2020 • edited Loading

mandarjog commented Mar 9, 2020

oschaaf commented Apr 15, 2020

oschaaf commented Feb 23, 2020 •

edited

Loading