Skip to content

Commit

Permalink
mod4
Browse files Browse the repository at this point in the history
  • Loading branch information
mrobson committed Nov 22, 2024
1 parent 4fa1701 commit 7b91a96
Show file tree
Hide file tree
Showing 2 changed files with 179 additions and 9 deletions.
22 changes: 15 additions & 7 deletions content/modules/ROOT/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -9,18 +9,26 @@
*** xref:module-01.adoc#ocpinsightsintro[ocp_insights - Parse and view your Insights data from the Insights Operator]
* xref:module-02.adoc[2. Intro to omc]
** xref:module-02.adoc#gettingstarted[Getting Started]
** xref:module-02.adoc#certs[Checking Cluster Certs]
** xref:module-02.adoc#etcd[Reviewing etcd]
** xref:module-02.adoc#haproxy[Reviewing HAProxy Backends]
** xref:module-02.adoc#node-logs[Reviewing Control Plane Node-Logs]
** xref:module-02.adoc#ovn[Reviewing OVN Subnets]
** xref:module-02.adoc#prometheus[Reviewing Prometheus Alert Groups, Rules, and Targets]
* xref:module-03.adoc[3. vSphere IPI - I can not scale up any new nodes]
** xref:module-03.adoc#gettingstarted[Getting Started]
** xref:module-03.adoc#certs[Checking Cluster Certs]
** xref:module-03.adoc#etcd[Reviewing etcd]
** xref:module-03.adoc#haproxy[Reviewing HAProxy Backends]
** xref:module-03.adoc#node-logs[Reviewing Control Plane Node-Logs]
** xref:module-03.adoc#ovn[Reviewing OVN Subnets]
** xref:module-03.adoc#prometheus[Reviewing Prometheus Alert Groups, Rules, and Targets]
** xref:module-03.adoc#checknodes[Check the nodes and the machines]
** xref:module-03.adoc#checkmachineapi[Check the Machine API]
** xref:module-03.adoc#checkserver[Check the Server]
** xref:module-03.adoc#findtheissue[Finding the Issue]
* xref:module-04.adoc[4. What is overloading my API?]
** xref:module-04.adoc#theapi[What is hitting my API?]
** xref:module-04.adoc#explore[Explore the `kubectl-dev_tool audit` command]
** xref:module-04.adoc#firstrun[Run a command]
** xref:module-04.adoc#theissue[Checking for issues]
** xref:module-04.adoc#thedata[Evaluate and dig deeper]
* xref:module-05.adoc[5. ]
Expand Down
166 changes: 164 additions & 2 deletions content/modules/ROOT/pages/module-04.adoc
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
= What is overloading my API?
:prewrap!:

A customer reported an issue where access to the OpenShift API was intermittent or slow. +
A customer reported an issue where access to the OpenShift API was intermittent and or slow. +

.The customer provided the following information:
************************************************
Expand All @@ -15,13 +15,175 @@ These probe failures caused several pods in a deployment or statefulset to resta

How do we check who, what and how often somthing is hitting the API of an OpenShift cluster?

In this lab we are going to explore the usage of the `OpenShift Cluster Debug Tools`, specifically `kubectl-dev_tool` and the `kubectl-dev_tool audit` subcommand.

https://github.com/openshift/cluster-debug-tools/

In addition to the stadard must-gather, there are a number of additional variants for collecting data for special use cases.

For this exercise, we asked the customer to collect the audit logs from their cluster so we can analyze every request going to the API.

[TIP]
=====
Use the following command to collect an `audit log` `must-gather`:
Can you use the following command to collect an `audit log` `must-gather`:
oc adm must-gather -- /usr/bin/gather_audit_logs
=====

[#explore]
== Explore the `kubectl-dev_tool audit` command

Lets take a look at the `kubectl-dev_tool audit` subcommand by running it with the `-h` flag to look at the help screen.

On this screen, you will find a 5 examples to give you some understanding of how to use the tool.

[TIP]
====
When using the `-f` flag, you can pass a single file, as in the example, or the entire directory.
`-f 'audit_logs/kube-apiserver'`
====

[#firstrun]
== Run a command

Let's running a command against the `kube-apiserver` and see what kind of information we get back.

Depending on the amount of data and the complexity of the search, it can take some time to process.

Using the example from the help page, we can run a command like this:

====
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --verb=get --resource='*.*' --resource='-subjectaccessreviews.*' --resource='-tokenreviews.*' | more
====

The first thing you will see is a log outlining the `count` (number of records), followed by a timerange and duration for the period the audit logs cover.

====
count: 1149712, first: 2022-02-07T17:15:16-05:00, last: 2022-02-07T19:59:59-05:00, duration: 2h44m42.591072s
====

After that, you will see all of the returned data.

In this case, we are looking at individual `GET` requests, how long they took, what they accessed and who made the request.

[source,bash]
----
23:05:05 [ GET][ 1.297ms] [200] /api/v1/namespaces/ openshift-console-user-settings/configmaps/user-settings-ec294610-20a8-4878-plmb7-08aa00a5c0f2 [user@identity]
----

[NOTE]
====
The larger and busier the cluster, the shorter the audit log duration will be.
If you have a requirement to maintain or potentially review the audit logs, you need to send them to an external logging source for retention.
====

[#theissue]
== Checking for issues

Now that we know a little bit about the `kubectl-dev_tool`, let's look at our `must-gather` for potential issues.

[TIP]
====
There is a helpful flag you can use `-o top` which will count, group and display the top 10 entries for your search.
====

Start by taking a high level view. You can be both broad and granular with audit logs, but unless you know exactly what you're looking for, it's best to case a wide net.

Look at the top usage for the common `--by=` groups like `resource` and `user`

.Click to show some commands if you need a hint
[%collapsible]
====
[source,bash]
----
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --by=resource -otop
and
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --by=user -otop
----
====
[#thedata]
== Evaluate and dig deeper
We spotted something suspiscious, so let's drill down a little deeper.
[TIP]
====
We evaluating the data always factor in things like the total number of requests, time period and the number of nodes.
====

.Click to show some details if you need a hint
[%collapsible]
====
Our top 3 resources from the previous command were `nodes`, `configmaps` and `pods`:
----
464191x v1/nodes
372952x v1/configmaps
357233x v1/pods
----

Our top 3 users from the previous command were `sysdig-agent`, `apiserver` and `openshift-apiserver-sa`
----
446278x system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent
76068x system:apiserver
63661x system:serviceaccount:openshift-apiserver:openshift-apiserver-sa
----
====
One of those sticks out a lot, but let's first take a look at our top 3 resources. For this we can use the `--resource=` flag, in addition to `--by=` and `-o top` to drill on a specific resource.
.Click to show some details if you need a hint
[%collapsible]
====
----
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource=nodes -otop --by=user
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource=configmaps -otop --by=user
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource=pods -otop --by=user
----
====
The data for `configmaps` and `pods` looks pretty spread out across a variety of users. There are no obvious bad actors.
But for nodes, there is one big outlier. It aligns with the outlier we saw from the previous `--by=user` output.
So lets take a look at that specific user and see what they are doing. You can do this by passing in the `--user=` flag along with `--by=verb` and `-o top`.
Let's try to answer the following:
What is the user doing?
What is the problem?
.Click to show some details if you need a hint
[%collapsible]
====
----
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --user=system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent --by=verb -otop
----

What we see is very interesting:

. The majority are `GET` requests to the `/proxy/metrics` endpoint of every node.
. They're all returning a `403` error

----
Top 10 "GET" (of 440076 total hits):
8313x [ 274.335µs] [403-8312] /api/v1/nodes/cluster-app-38.dmz/proxy/metrics [system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent]
8309x [ 272.092µs] [403-8308] /api/v1/nodes/cluster-app-25.dmz/proxy/metrics [system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent]
8308x [ 270.327µs] [403-8307] /api/v1/nodes/cluster-app-02.dmz/proxy/metrics [system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent]
----

The conclusion is that there is an issue with the sysdig monitoring component that is causing it to fail authentication and spam the API server when trying to collect metrics.
====

I hope you found this introduction to the `kubectl-dev_tool` useful and can leverage it the next time you have an issue!

[TIP]
====
You don't need to have an overloaded API or a performance issue to take a look at the audit logs.
The audit logs and `kubectl-dev_tool` are equally useful if you want to understand who or what did something in your cluster.
Something deleted your pod? That's in the audit logs! Use the `kubectl-dev_tool` to find out who did it and when!
====

0 comments on commit 7b91a96

Please sign in to comment.