-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[processor/k8sattribute] Handle resource deletion on DeletedFinalStateUnknown #27632
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Pinging code owners for processor/k8sattributes: @dmitryax @rmfitzpatrick @fatsheep9146 @TylerHelmuth. See Adding Labels via Comments if you do not have permissions to add labels yourself. |
@c-kruse thanks for opening this issue. Do you plan to submit a fix? |
@TylerHelmuth I would be glad to. May be a day or two out. |
…27847) **Description:** The k8s go client's cache expects OnDelete handlers to handle objects of type DeletedFinalStateUnknown when the cache's watch mechanism misses a delete and notices later. This changes the processor to handle such deletes as if they were normal, rather than logging an error and dropping the change. **Link to tracking Issue:** #27632 **Testing:** Only what you see in the unit tests. I am open to suggestions, but I don't see this being a code path we can reasonably cover in the e2e test suite. Verified manually locally on a kind cluster. * Stood up two deployments loosely based off e2e testing resources, one w/ a collector built from this branch and the other docker.io/otel/opentelemetry-collector-contrib:latest. * Both included an additional container in the collector pod I used to fiddle with iptables rules. * Added rules to reject traffic to/from the kube api server * Deleted some namespaces containing deployments generating telemetry. * Restored connectivity by removing the iptables rules. * Observed the collector built from this branch was silent (aside from the junk the k8s client logs due to the broken connection) * Observed the latest ([0.87.0](https://hub.docker.com/layers/otel/opentelemetry-collector-contrib/0.87.0/images/sha256-77cdd395b828b09cb920c671966f09a87a40611aa6107443146086f2046f4a9a?context=explore)) collector logged a handful of errors for the deleted resources (api_v1.Pod, and apps_v1.ReplicaSet. I probably just didn't wait long enough for Namespace.) ``` 2023-10-19T02:18:37.781Z error kube/client.go:236 object received was not of type api_v1.Pod {"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "received": {"Key":"src1/telemetrygen-patched-766d55cbcb-8zktr","Obj":{"metadata":{"name":"telemetrygen-patched-766d55cbcb-8zktr","namespace":"src1","uid":"be5d2268-c8b0-434d-b3b8-8b18083c7a8b","creat ionTimestamp":"2023-10-19T02:01:08Z","labels":{"app":"telemetrygen-patched","pod-template-hash":"766d55cbcb"},"ownerReferences":[{"apiVersion":"apps/v1","kind":"ReplicaSet","name":"telemetrygen-patched-766d55cbcb","uid":"a887d67a-d5d6-4269-b520-45dbb4f1cd82","controller":true,"blockOwnerDeletion":true}]},"spec":{"containers":[{"name":"telemetrygen","image":"localhost/telemetrygen :latest","resources":{}}],"nodeName":"manual-e2e-testing-control-plane"},"status":{"podIP":"10.244.0.56","startTime":"2023-10-19T02:01:08Z","containerStatuses":[{"name":"telemetrygen","state":{},"lastState":{},"ready":false,"restartCount":0,"image":"","imageID":"","containerID":"containerd://2821ef32cd8bf93a13414504c0f8f0c016c84be49d6ffdbd475d7e4681e90c51"}]}}}} github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.(*WatchClient).handlePodDelete github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/internal/kube/client.go:236 k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete k8s.io/[email protected]/tools/cache/controller.go:253 ... 2023-10-19T02:19:03.970Z error kube/client.go:868 object received was not of type apps_v1.ReplicaSet {"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "received": {"Key":"src1/telemetrygen-stable-5c444bb8b8","Obj":{"metadata":{"name":"telemetrygen-stable-5c444bb8b8","namespace":"src1","uid":"d37707ff-b308-4339-8543-a1caf5705ea8","creationTimestamp":null,"ownerReferences":[{"apiVersion":"apps/v1","kind":"Deployment","name":"telemetrygen-stable","uid":"c421276e-e1bf-40c5-85e1-e92e30363da5","controller":true,"blockOwnerDeletion":true}]},"spec":{"selector":null,"template":{"metadata":{"creationTimestamp":null},"spec":{"containers":null}}},"status":{"replicas":0}}}} github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.(*WatchClient).handleReplicaSetDelete github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/internal/kube/client.go:868 k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete k8s.io/[email protected]/tools/cache/controller.go:253 k8s.io/client-go/tools/cache.(*processorListener).run.func1 k8s.io/[email protected]/tools/cache/shared_informer.go:979 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1 ... ``` **Documentation:** N/A - it is not clear to me whether or not this should land on the changelog. Its impact on users is marginal. Signed-off-by: Christian Kruse <[email protected]>
…pen-telemetry#27847) **Description:** The k8s go client's cache expects OnDelete handlers to handle objects of type DeletedFinalStateUnknown when the cache's watch mechanism misses a delete and notices later. This changes the processor to handle such deletes as if they were normal, rather than logging an error and dropping the change. **Link to tracking Issue:** open-telemetry#27632 **Testing:** Only what you see in the unit tests. I am open to suggestions, but I don't see this being a code path we can reasonably cover in the e2e test suite. Verified manually locally on a kind cluster. * Stood up two deployments loosely based off e2e testing resources, one w/ a collector built from this branch and the other docker.io/otel/opentelemetry-collector-contrib:latest. * Both included an additional container in the collector pod I used to fiddle with iptables rules. * Added rules to reject traffic to/from the kube api server * Deleted some namespaces containing deployments generating telemetry. * Restored connectivity by removing the iptables rules. * Observed the collector built from this branch was silent (aside from the junk the k8s client logs due to the broken connection) * Observed the latest ([0.87.0](https://hub.docker.com/layers/otel/opentelemetry-collector-contrib/0.87.0/images/sha256-77cdd395b828b09cb920c671966f09a87a40611aa6107443146086f2046f4a9a?context=explore)) collector logged a handful of errors for the deleted resources (api_v1.Pod, and apps_v1.ReplicaSet. I probably just didn't wait long enough for Namespace.) ``` 2023-10-19T02:18:37.781Z error kube/client.go:236 object received was not of type api_v1.Pod {"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "received": {"Key":"src1/telemetrygen-patched-766d55cbcb-8zktr","Obj":{"metadata":{"name":"telemetrygen-patched-766d55cbcb-8zktr","namespace":"src1","uid":"be5d2268-c8b0-434d-b3b8-8b18083c7a8b","creat ionTimestamp":"2023-10-19T02:01:08Z","labels":{"app":"telemetrygen-patched","pod-template-hash":"766d55cbcb"},"ownerReferences":[{"apiVersion":"apps/v1","kind":"ReplicaSet","name":"telemetrygen-patched-766d55cbcb","uid":"a887d67a-d5d6-4269-b520-45dbb4f1cd82","controller":true,"blockOwnerDeletion":true}]},"spec":{"containers":[{"name":"telemetrygen","image":"localhost/telemetrygen :latest","resources":{}}],"nodeName":"manual-e2e-testing-control-plane"},"status":{"podIP":"10.244.0.56","startTime":"2023-10-19T02:01:08Z","containerStatuses":[{"name":"telemetrygen","state":{},"lastState":{},"ready":false,"restartCount":0,"image":"","imageID":"","containerID":"containerd://2821ef32cd8bf93a13414504c0f8f0c016c84be49d6ffdbd475d7e4681e90c51"}]}}}} github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.(*WatchClient).handlePodDelete github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/internal/kube/client.go:236 k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete k8s.io/[email protected]/tools/cache/controller.go:253 ... 2023-10-19T02:19:03.970Z error kube/client.go:868 object received was not of type apps_v1.ReplicaSet {"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "received": {"Key":"src1/telemetrygen-stable-5c444bb8b8","Obj":{"metadata":{"name":"telemetrygen-stable-5c444bb8b8","namespace":"src1","uid":"d37707ff-b308-4339-8543-a1caf5705ea8","creationTimestamp":null,"ownerReferences":[{"apiVersion":"apps/v1","kind":"Deployment","name":"telemetrygen-stable","uid":"c421276e-e1bf-40c5-85e1-e92e30363da5","controller":true,"blockOwnerDeletion":true}]},"spec":{"selector":null,"template":{"metadata":{"creationTimestamp":null},"spec":{"containers":null}}},"status":{"replicas":0}}}} github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.(*WatchClient).handleReplicaSetDelete github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/internal/kube/client.go:868 k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete k8s.io/[email protected]/tools/cache/controller.go:253 k8s.io/client-go/tools/cache.(*processorListener).run.func1 k8s.io/[email protected]/tools/cache/shared_informer.go:979 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1 ... ``` **Documentation:** N/A - it is not clear to me whether or not this should land on the changelog. Its impact on users is marginal. Signed-off-by: Christian Kruse <[email protected]>
…pen-telemetry#27847) **Description:** The k8s go client's cache expects OnDelete handlers to handle objects of type DeletedFinalStateUnknown when the cache's watch mechanism misses a delete and notices later. This changes the processor to handle such deletes as if they were normal, rather than logging an error and dropping the change. **Link to tracking Issue:** open-telemetry#27632 **Testing:** Only what you see in the unit tests. I am open to suggestions, but I don't see this being a code path we can reasonably cover in the e2e test suite. Verified manually locally on a kind cluster. * Stood up two deployments loosely based off e2e testing resources, one w/ a collector built from this branch and the other docker.io/otel/opentelemetry-collector-contrib:latest. * Both included an additional container in the collector pod I used to fiddle with iptables rules. * Added rules to reject traffic to/from the kube api server * Deleted some namespaces containing deployments generating telemetry. * Restored connectivity by removing the iptables rules. * Observed the collector built from this branch was silent (aside from the junk the k8s client logs due to the broken connection) * Observed the latest ([0.87.0](https://hub.docker.com/layers/otel/opentelemetry-collector-contrib/0.87.0/images/sha256-77cdd395b828b09cb920c671966f09a87a40611aa6107443146086f2046f4a9a?context=explore)) collector logged a handful of errors for the deleted resources (api_v1.Pod, and apps_v1.ReplicaSet. I probably just didn't wait long enough for Namespace.) ``` 2023-10-19T02:18:37.781Z error kube/client.go:236 object received was not of type api_v1.Pod {"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "received": {"Key":"src1/telemetrygen-patched-766d55cbcb-8zktr","Obj":{"metadata":{"name":"telemetrygen-patched-766d55cbcb-8zktr","namespace":"src1","uid":"be5d2268-c8b0-434d-b3b8-8b18083c7a8b","creat ionTimestamp":"2023-10-19T02:01:08Z","labels":{"app":"telemetrygen-patched","pod-template-hash":"766d55cbcb"},"ownerReferences":[{"apiVersion":"apps/v1","kind":"ReplicaSet","name":"telemetrygen-patched-766d55cbcb","uid":"a887d67a-d5d6-4269-b520-45dbb4f1cd82","controller":true,"blockOwnerDeletion":true}]},"spec":{"containers":[{"name":"telemetrygen","image":"localhost/telemetrygen :latest","resources":{}}],"nodeName":"manual-e2e-testing-control-plane"},"status":{"podIP":"10.244.0.56","startTime":"2023-10-19T02:01:08Z","containerStatuses":[{"name":"telemetrygen","state":{},"lastState":{},"ready":false,"restartCount":0,"image":"","imageID":"","containerID":"containerd://2821ef32cd8bf93a13414504c0f8f0c016c84be49d6ffdbd475d7e4681e90c51"}]}}}} github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.(*WatchClient).handlePodDelete github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/internal/kube/client.go:236 k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete k8s.io/[email protected]/tools/cache/controller.go:253 ... 2023-10-19T02:19:03.970Z error kube/client.go:868 object received was not of type apps_v1.ReplicaSet {"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "received": {"Key":"src1/telemetrygen-stable-5c444bb8b8","Obj":{"metadata":{"name":"telemetrygen-stable-5c444bb8b8","namespace":"src1","uid":"d37707ff-b308-4339-8543-a1caf5705ea8","creationTimestamp":null,"ownerReferences":[{"apiVersion":"apps/v1","kind":"Deployment","name":"telemetrygen-stable","uid":"c421276e-e1bf-40c5-85e1-e92e30363da5","controller":true,"blockOwnerDeletion":true}]},"spec":{"selector":null,"template":{"metadata":{"creationTimestamp":null},"spec":{"containers":null}}},"status":{"replicas":0}}}} github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.(*WatchClient).handleReplicaSetDelete github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/internal/kube/client.go:868 k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete k8s.io/[email protected]/tools/cache/controller.go:253 k8s.io/client-go/tools/cache.(*processorListener).run.func1 k8s.io/[email protected]/tools/cache/shared_informer.go:979 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1 ... ``` **Documentation:** N/A - it is not clear to me whether or not this should land on the changelog. Its impact on users is marginal. Signed-off-by: Christian Kruse <[email protected]>
Fixed by #27847 |
…pen-telemetry#27847) **Description:** The k8s go client's cache expects OnDelete handlers to handle objects of type DeletedFinalStateUnknown when the cache's watch mechanism misses a delete and notices later. This changes the processor to handle such deletes as if they were normal, rather than logging an error and dropping the change. **Link to tracking Issue:** open-telemetry#27632 **Testing:** Only what you see in the unit tests. I am open to suggestions, but I don't see this being a code path we can reasonably cover in the e2e test suite. Verified manually locally on a kind cluster. * Stood up two deployments loosely based off e2e testing resources, one w/ a collector built from this branch and the other docker.io/otel/opentelemetry-collector-contrib:latest. * Both included an additional container in the collector pod I used to fiddle with iptables rules. * Added rules to reject traffic to/from the kube api server * Deleted some namespaces containing deployments generating telemetry. * Restored connectivity by removing the iptables rules. * Observed the collector built from this branch was silent (aside from the junk the k8s client logs due to the broken connection) * Observed the latest ([0.87.0](https://hub.docker.com/layers/otel/opentelemetry-collector-contrib/0.87.0/images/sha256-77cdd395b828b09cb920c671966f09a87a40611aa6107443146086f2046f4a9a?context=explore)) collector logged a handful of errors for the deleted resources (api_v1.Pod, and apps_v1.ReplicaSet. I probably just didn't wait long enough for Namespace.) ``` 2023-10-19T02:18:37.781Z error kube/client.go:236 object received was not of type api_v1.Pod {"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "received": {"Key":"src1/telemetrygen-patched-766d55cbcb-8zktr","Obj":{"metadata":{"name":"telemetrygen-patched-766d55cbcb-8zktr","namespace":"src1","uid":"be5d2268-c8b0-434d-b3b8-8b18083c7a8b","creat ionTimestamp":"2023-10-19T02:01:08Z","labels":{"app":"telemetrygen-patched","pod-template-hash":"766d55cbcb"},"ownerReferences":[{"apiVersion":"apps/v1","kind":"ReplicaSet","name":"telemetrygen-patched-766d55cbcb","uid":"a887d67a-d5d6-4269-b520-45dbb4f1cd82","controller":true,"blockOwnerDeletion":true}]},"spec":{"containers":[{"name":"telemetrygen","image":"localhost/telemetrygen :latest","resources":{}}],"nodeName":"manual-e2e-testing-control-plane"},"status":{"podIP":"10.244.0.56","startTime":"2023-10-19T02:01:08Z","containerStatuses":[{"name":"telemetrygen","state":{},"lastState":{},"ready":false,"restartCount":0,"image":"","imageID":"","containerID":"containerd://2821ef32cd8bf93a13414504c0f8f0c016c84be49d6ffdbd475d7e4681e90c51"}]}}}} github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.(*WatchClient).handlePodDelete github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/internal/kube/client.go:236 k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete k8s.io/[email protected]/tools/cache/controller.go:253 ... 2023-10-19T02:19:03.970Z error kube/client.go:868 object received was not of type apps_v1.ReplicaSet {"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "received": {"Key":"src1/telemetrygen-stable-5c444bb8b8","Obj":{"metadata":{"name":"telemetrygen-stable-5c444bb8b8","namespace":"src1","uid":"d37707ff-b308-4339-8543-a1caf5705ea8","creationTimestamp":null,"ownerReferences":[{"apiVersion":"apps/v1","kind":"Deployment","name":"telemetrygen-stable","uid":"c421276e-e1bf-40c5-85e1-e92e30363da5","controller":true,"blockOwnerDeletion":true}]},"spec":{"selector":null,"template":{"metadata":{"creationTimestamp":null},"spec":{"containers":null}}},"status":{"replicas":0}}}} github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.(*WatchClient).handleReplicaSetDelete github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/internal/kube/client.go:868 k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete k8s.io/[email protected]/tools/cache/controller.go:253 k8s.io/client-go/tools/cache.(*processorListener).run.func1 k8s.io/[email protected]/tools/cache/shared_informer.go:979 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1 ... ``` **Documentation:** N/A - it is not clear to me whether or not this should land on the changelog. Its impact on users is marginal. Signed-off-by: Christian Kruse <[email protected]>
Component(s)
processor/k8sattributes
What happened?
Description
We have a similar component that we develop in the sumo logic collector distribution. This bug was discovered there. It appears that it exists here as well, although it is comparatively benign as it will just leak a bit of memory rather than panic as was the case in our processor.
Related issue: SumoLogic/sumologic-otel-collector#1277
Relevant doc snippet from pkg.go.dev/k8s.io/client-go:
Steps to Reproduce
Running the collector in a k8s cluster, create a network segmentation between the kube apiserver and the collector. Wait long enough for the open connections to break, then re-establish connectivity. You should see log entries for this and similar cases: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/k8sattributesprocessor/internal/kube/client.go#L236
Happy to share info on the setup I used to reproduce on our distro, but it won't likely be worth much more than inspiration for y'all.
Expected Result
I think the resource should be deleted as usual.
Actual Result
An error is logged and the resource is left hanging.
Collector version
v0.86.0, v0.85.0
Environment information
N/A
OpenTelemetry Collector configuration
No response
Log output
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: