From 7b91a9695fc98ec430540f65c55abe873e8653b8 Mon Sep 17 00:00:00 2001 From: mrobson Date: Thu, 21 Nov 2024 19:20:47 -0500 Subject: [PATCH] mod4 --- content/modules/ROOT/nav.adoc | 22 ++- content/modules/ROOT/pages/module-04.adoc | 166 +++++++++++++++++++++- 2 files changed, 179 insertions(+), 9 deletions(-) diff --git a/content/modules/ROOT/nav.adoc b/content/modules/ROOT/nav.adoc index e396a99..43848be 100644 --- a/content/modules/ROOT/nav.adoc +++ b/content/modules/ROOT/nav.adoc @@ -9,18 +9,26 @@ *** xref:module-01.adoc#ocpinsightsintro[ocp_insights - Parse and view your Insights data from the Insights Operator] * xref:module-02.adoc[2. Intro to omc] +** xref:module-02.adoc#gettingstarted[Getting Started] +** xref:module-02.adoc#certs[Checking Cluster Certs] +** xref:module-02.adoc#etcd[Reviewing etcd] +** xref:module-02.adoc#haproxy[Reviewing HAProxy Backends] +** xref:module-02.adoc#node-logs[Reviewing Control Plane Node-Logs] +** xref:module-02.adoc#ovn[Reviewing OVN Subnets] +** xref:module-02.adoc#prometheus[Reviewing Prometheus Alert Groups, Rules, and Targets] * xref:module-03.adoc[3. vSphere IPI - I can not scale up any new nodes] -** xref:module-03.adoc#gettingstarted[Getting Started] -** xref:module-03.adoc#certs[Checking Cluster Certs] -** xref:module-03.adoc#etcd[Reviewing etcd] -** xref:module-03.adoc#haproxy[Reviewing HAProxy Backends] -** xref:module-03.adoc#node-logs[Reviewing Control Plane Node-Logs] -** xref:module-03.adoc#ovn[Reviewing OVN Subnets] -** xref:module-03.adoc#prometheus[Reviewing Prometheus Alert Groups, Rules, and Targets] +** xref:module-03.adoc#checknodes[Check the nodes and the machines] +** xref:module-03.adoc#checkmachineapi[Check the Machine API] +** xref:module-03.adoc#checkserver[Check the Server] +** xref:module-03.adoc#findtheissue[Finding the Issue] * xref:module-04.adoc[4. What is overloading my API?] ** xref:module-04.adoc#theapi[What is hitting my API?] +** xref:module-04.adoc#explore[Explore the `kubectl-dev_tool audit` command] +** xref:module-04.adoc#firstrun[Run a command] +** xref:module-04.adoc#theissue[Checking for issues] +** xref:module-04.adoc#thedata[Evaluate and dig deeper] * xref:module-05.adoc[5. ] diff --git a/content/modules/ROOT/pages/module-04.adoc b/content/modules/ROOT/pages/module-04.adoc index a67ac48..e129c19 100644 --- a/content/modules/ROOT/pages/module-04.adoc +++ b/content/modules/ROOT/pages/module-04.adoc @@ -1,7 +1,7 @@ = What is overloading my API? :prewrap!: -A customer reported an issue where access to the OpenShift API was intermittent or slow. + +A customer reported an issue where access to the OpenShift API was intermittent and or slow. + .The customer provided the following information: ************************************************ @@ -15,13 +15,175 @@ These probe failures caused several pods in a deployment or statefulset to resta How do we check who, what and how often somthing is hitting the API of an OpenShift cluster? +In this lab we are going to explore the usage of the `OpenShift Cluster Debug Tools`, specifically `kubectl-dev_tool` and the `kubectl-dev_tool audit` subcommand. + +https://github.com/openshift/cluster-debug-tools/ + In addition to the stadard must-gather, there are a number of additional variants for collecting data for special use cases. For this exercise, we asked the customer to collect the audit logs from their cluster so we can analyze every request going to the API. [TIP] ===== -Use the following command to collect an `audit log` `must-gather`: +Can you use the following command to collect an `audit log` `must-gather`: oc adm must-gather -- /usr/bin/gather_audit_logs ===== + +[#explore] +== Explore the `kubectl-dev_tool audit` command + +Lets take a look at the `kubectl-dev_tool audit` subcommand by running it with the `-h` flag to look at the help screen. + +On this screen, you will find a 5 examples to give you some understanding of how to use the tool. + +[TIP] +==== +When using the `-f` flag, you can pass a single file, as in the example, or the entire directory. + +`-f 'audit_logs/kube-apiserver'` +==== + +[#firstrun] +== Run a command + +Let's running a command against the `kube-apiserver` and see what kind of information we get back. + +Depending on the amount of data and the complexity of the search, it can take some time to process. + +Using the example from the help page, we can run a command like this: + +==== +kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --verb=get --resource='*.*' --resource='-subjectaccessreviews.*' --resource='-tokenreviews.*' | more +==== + +The first thing you will see is a log outlining the `count` (number of records), followed by a timerange and duration for the period the audit logs cover. + +==== +count: 1149712, first: 2022-02-07T17:15:16-05:00, last: 2022-02-07T19:59:59-05:00, duration: 2h44m42.591072s +==== + +After that, you will see all of the returned data. + +In this case, we are looking at individual `GET` requests, how long they took, what they accessed and who made the request. + +[source,bash] +---- +23:05:05 [ GET][ 1.297ms] [200] /api/v1/namespaces/ openshift-console-user-settings/configmaps/user-settings-ec294610-20a8-4878-plmb7-08aa00a5c0f2 [user@identity] +---- + +[NOTE] +==== +The larger and busier the cluster, the shorter the audit log duration will be. + +If you have a requirement to maintain or potentially review the audit logs, you need to send them to an external logging source for retention. +==== + +[#theissue] +== Checking for issues + +Now that we know a little bit about the `kubectl-dev_tool`, let's look at our `must-gather` for potential issues. + +[TIP] +==== +There is a helpful flag you can use `-o top` which will count, group and display the top 10 entries for your search. +==== + +Start by taking a high level view. You can be both broad and granular with audit logs, but unless you know exactly what you're looking for, it's best to case a wide net. + +Look at the top usage for the common `--by=` groups like `resource` and `user` + +.Click to show some commands if you need a hint +[%collapsible] +==== +[source,bash] +---- +kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --by=resource -otop +and +kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --by=user -otop +---- +==== + +[#thedata] +== Evaluate and dig deeper + +We spotted something suspiscious, so let's drill down a little deeper. + +[TIP] +==== +We evaluating the data always factor in things like the total number of requests, time period and the number of nodes. +==== + +.Click to show some details if you need a hint +[%collapsible] +==== +Our top 3 resources from the previous command were `nodes`, `configmaps` and `pods`: +---- +464191x v1/nodes +372952x v1/configmaps +357233x v1/pods +---- + +Our top 3 users from the previous command were `sysdig-agent`, `apiserver` and `openshift-apiserver-sa` +---- +446278x system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent +76068x system:apiserver +63661x system:serviceaccount:openshift-apiserver:openshift-apiserver-sa +---- +==== + +One of those sticks out a lot, but let's first take a look at our top 3 resources. For this we can use the `--resource=` flag, in addition to `--by=` and `-o top` to drill on a specific resource. + +.Click to show some details if you need a hint +[%collapsible] +==== +---- +kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource=nodes -otop --by=user +kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource=configmaps -otop --by=user +kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource=pods -otop --by=user +---- +==== + +The data for `configmaps` and `pods` looks pretty spread out across a variety of users. There are no obvious bad actors. + +But for nodes, there is one big outlier. It aligns with the outlier we saw from the previous `--by=user` output. + +So lets take a look at that specific user and see what they are doing. You can do this by passing in the `--user=` flag along with `--by=verb` and `-o top`. + +Let's try to answer the following: + +What is the user doing? +What is the problem? + +.Click to show some details if you need a hint +[%collapsible] +==== +---- +kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --user=system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent --by=verb -otop +---- + +What we see is very interesting: + +. The majority are `GET` requests to the `/proxy/metrics` endpoint of every node. +. They're all returning a `403` error + +---- +Top 10 "GET" (of 440076 total hits): + 8313x [ 274.335µs] [403-8312] /api/v1/nodes/cluster-app-38.dmz/proxy/metrics [system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent] + 8309x [ 272.092µs] [403-8308] /api/v1/nodes/cluster-app-25.dmz/proxy/metrics [system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent] + 8308x [ 270.327µs] [403-8307] /api/v1/nodes/cluster-app-02.dmz/proxy/metrics [system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent] +---- + +The conclusion is that there is an issue with the sysdig monitoring component that is causing it to fail authentication and spam the API server when trying to collect metrics. +==== + +I hope you found this introduction to the `kubectl-dev_tool` useful and can leverage it the next time you have an issue! + +[TIP] +==== +You don't need to have an overloaded API or a performance issue to take a look at the audit logs. + +The audit logs and `kubectl-dev_tool` are equally useful if you want to understand who or what did something in your cluster. + +Something deleted your pod? That's in the audit logs! Use the `kubectl-dev_tool` to find out who did it and when! +==== \ No newline at end of file