From d37372c3636031f93a7dc8f1cb56ae1ec1adfca6 Mon Sep 17 00:00:00 2001
From: Karl Cardenas <29551334+karl-cardenas-coding@users.noreply.github.com>
Date: Thu, 27 Jun 2024 11:57:33 -0700
Subject: [PATCH] docs: DOC-1261 cilium issue (#3171)
* docs: DOC-1261 cilium issue
* docs: updated know issue
* docs: Apply suggestions from code review
Co-authored-by: Adelina Simion <43963729+addetz@users.noreply.github.com>
* chore: fix format
---------
Co-authored-by: Adelina Simion <43963729+addetz@users.noreply.github.com>
---
docs/docs-content/integrations/cilium.md | 56 +++++++++++++++++--
.../release-notes/known-issues.md | 53 +++++++++---------
2 files changed, 77 insertions(+), 32 deletions(-)
diff --git a/docs/docs-content/integrations/cilium.md b/docs/docs-content/integrations/cilium.md
index 68c0b05bc5..f77a25ff0d 100644
--- a/docs/docs-content/integrations/cilium.md
+++ b/docs/docs-content/integrations/cilium.md
@@ -22,19 +22,25 @@ policies are applied and updated independent of the application code or containe
The Cilium agent runs on all clusters and servers to provide networking, security and observability to the workload
running on that node.
-## Prerequisite
+## Versions Supported
-- If the user is going for the BYO (Bring your own) Operating system use case then, HWE (Hardware Enabled) Kernel or a
- Kernel that supports [eBPF](https://ebpf.io/) modules needs to be provisioned.
+
+
-**Palette OS images are by default provisioned with the above pre-requisite.**
+## Prerequisite
-## Versions Supported
+- If you are using Bring Your Own Operating System (BYOOS), then HWE (Hardware Enabled) Kernel or a Kernel that supports
+ [eBPF](https://ebpf.io/) modules needs to be provisioned.
-
+
+## Prerequisite
+
+- If you are using Bring Your Own Operating System (BYOOS), then HWE (Hardware Enabled) Kernel or a Kernel that supports
+ [eBPF](https://ebpf.io/) modules needs to be provisioned.
+
@@ -44,6 +50,44 @@ All versions below version 1.14.x are deprecated. We recommend you to upgrade to
+## Troubleshooting
+
+Review the following common issues and solutions when using the Cilium network pack.
+
+### I/O Timeout Error on VMware
+
+If you are deploying a cluster to a VMware environment using the VXLAN tunnel protocol, you may encounter I/O timeout
+errors. This is due to a known bug in the VXMNET3 adapter that results in VXLAN traffic being dropped. You can learn
+more about this issue in Cilium's [GitHub issue #21801](https://github.com/cilium/cilium/issues/21801).
+
+You can work around the issue by using one of the two following methods:
+
+- Option 1: Set a different tunnel protocol in the Cilium configuration. You can set the tunnel protocol to `geneve`.
+
+ ```yaml
+ charts:
+ cilium:
+ tunnelProtocol: "geneve"
+ ```
+
+- Option 2: Modify the Operating System (OS) layer of your cluster profile to automatically disable UDP Segmentation
+ Offloading (USO).
+
+ ```yaml
+ kubeadmconfig:
+ preKubeadmCommands:
+ # Disable hardware segmentation offloading due to VMXNET3 issue
+ - |
+ install -m 0755 /dev/null /usr/lib/networkd-dispatcher/routable.d/10-disable-offloading
+ cat < /usr/lib/networkd-dispatcher/routable.d/10-disable-offloading
+ #!/bin/sh
+ ethtool -K eth0 tx-udp_tnl-segmentation off
+ ethtool -K eth0 tx-udp_tnl-csum-segmentation off
+ ethtool --offload eth0 rx off tx off
+ EOF
+ systemctl restart systemd-networkd
+ ```
+
## References
- [Cilium Documentation](https://docs.cilium.io/en/stable)
diff --git a/docs/docs-content/release-notes/known-issues.md b/docs/docs-content/release-notes/known-issues.md
index 88deee3f86..0f659fe4ed 100644
--- a/docs/docs-content/release-notes/known-issues.md
+++ b/docs/docs-content/release-notes/known-issues.md
@@ -14,32 +14,33 @@ to review and stay informed about the status of known issues in Palette. As issu
The following table lists all known issues that are currently active and affecting users.
-| Description | Workaround | Publish Date | Product Component |
-| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------- | ----------------- |
-| When you upgrade VerteX from version 4.3.x to 4.4.x, a few system pods may remain unhealthy and experience _CrashLoopBackOff_ errors. This issue only impacts VMware vSphere-based installations and occurs because the internal Mongo DNS is incorrectly configured in the configserver ConfigMap. | Refer to the [Mongo DNS Configmap Value is Incorrect](../troubleshooting/palette-upgrade.md#mongo-dns-configmap-value-is-incorrect) troubleshooting guide for detailed workaround steps. This issue may also impact Enterprise Cluster backup operations. | June 24, 2024 | VerteX |
-| [Sonobuoy](../clusters/cluster-management/compliance-scan.md#conformance-testing) scans fail to generate reports on airgapped Palette Edge clusters. | No workaround is available. | June 24, 2024 | Edge |
-| Clusters configured with OpenID Connect (OIDC) at the Kubernetes layer encounter issues when authenticating with the [non-admin Kubeconfig file](../clusters/cluster-management/kubeconfig.md#cluster-admin). Kubeconfig files using OIDC to authenticate will not work if the SSL certificate is set at the OIDC provider level. | Use the admin Kubeconfig file to authenticate with the cluster, as it does not use OIDC to authenticate. | June 21, 2024 | Clusters |
-| During the platform upgrade from Palette 4.3 to 4.4, Virtual Clusters may encounter a scenario where the pod `palette-controller-manager` is not upgraded to the newer version of Palette. The virtual cluster will continue to be operational, and this does not impact its functionality. | Refer to the [Controller Manager Pod Not Upgraded](../troubleshooting/palette-dev-engine.md#scenario---controller-manager-pod-not-upgraded) troubleshooting guide. | June 15, 2024 | Virtual Clusters |
-| The VerteX enterprise cluster is unable to complete backup operations. | No workaround is available. | June 15, 2024 | VerteX |
-| Edge hosts with FIPS-compliant RHEL Operating System (OS) distribution may encounter the error where the `systemd-resolved.service` service enters the **failed** state. This prevents the nameserver from being configured, which will result in cluster deployment failure. | Refer to [TroubleShooting](../troubleshooting/edge.md#scenario---systemd-resolvedservice-enters-failed-state) for a workaround. | June 15, 2024 | Edge |
-| The GKE cluster's Kubernetes pods are failing to start because the Kubernetes patch version is unavailable. This is encountered during pod restarts or node scaling operations. | Deploy a new cluster and use a GKE cluster profile that does not contain a Kubernetes pack layer with a patch version. Migrate the workloads from the existing cluster to the new cluster. This is a breaking change introduced in Palette 4.4.0 | June 15, 2024 | Packs, Clusters |
-| An issue prevents RKE2 and Palette eXtended Kubernetes (PXK) on version 1.29.4 from operating correctly with Canonical MAAS. | A temporary workaround is using a version lower than 1.29.4 when using MAAS. | June 15, 2024 | Packs, Clusters |
-| [MicroK8s](../integrations/microk8s.md) does not support multi-node control plane clusters. The upgrade strategy, `InPlaceUpgrade`, is the only option available for use. | No workaround is available. | June 15, 2024 | Packs |
-| Clusters using [MicroK8s](../integrations/microk8s.md) as the Kubernetes distribution, the control plane node fails to upgrade when using the `InPlaceUpgrade` strategy for sequential upgrades, such as upgrading from version 1.25.x to version 1.26.x and then to version 1.27.x. | Refer to the [Control Plane Node Fails to Upgrade in Sequential MicroK8s Upgrades](../troubleshooting/pack-issues.md) troubleshooting guide for resolution steps. | June 15, 2024 | Packs |
-| Azure IaaS clusters are having issues with deployed load balancers and ingress deployments when using Kubernetes versions 1.29.0 and 1.29.4. Incoming connections time out as a result due to a lack of network path inside the cluster. Azure AKS clusters are not impacted. | Use a Kubernetes version lower than 1.29.0 | June 12, 2024 | Clusters |
-| OIDC integration with Virtual Clusters is not functional. All other operations related to Virtual Clusters are operational. | No workaround is available. | Jun 11, 2024 | Virtual Clusters |
-| The VerteX enterprise cluster is unable to complete backup operations. | No workaround is available. | June 6, 2024 | VerteX |
-| Deploying self-hosted Palette or VerteX to a vSphere environment fails if vCenter has standalone hosts directly under a Datacenter. Persistent Volume (PV) provisioning fails due to an upstream issue with the vSphere Container Storage Interface (CSI) for all versions before v3.2.0. Palette and VerteX use the vSphere CSI version 3.1.2 internally. The issue may also occur in workload clusters deployed on vSphere using the same vSphere CSI for storage volume provisioning. | If you encounter the following error message when deploying self-hosted Palette or VerteX: `'ProvisioningFailed failed to provision volume with StorageClass "spectro-storage-class". Error: failed to fetch hosts from entity ComputeResource:domain-xyz` then use the following workaround. Remove standalone hosts directly under the Datacenter from vCenter and allow the volume provisioning to complete. After the volume is provisioned, you can add the standalone hosts back. You can also use a service account that does not have access to the standalone hosts as the user that deployed Palette. | May 21, 2024 | Self-Hosted |
-| Conducting cluster node scaling operations on a cluster undergoing a backup can lead to issues and potential unresponsiveness. | To avoid this, ensure no backup operations are in progress before scaling nodes or performing other cluster operations that change the cluster state | April 14, 2024 | Clusters |
-| Palette automatically creates an AWS security group for worker nodes using the format `-node`. If a security group with the same name already exists in the VPC, the cluster creation process fails. | To avoid this, ensure that no security group with the same name exists in the VPC before creating a cluster. | April 14, 2024 | Clusters |
-| K3s version 1.27.7 has been marked as _Deprecated_. This version has a known issue that causes clusters to crash. | Upgrade to a newer version of K3s to avoid the issue, such as versions 1.26.12, 1.28.5, and 1.27.11. You can learn more about the issue in the [K3s GitHub issue](https://github.com/k3s-io/k3s/issues/9047) page. | April 14, 2024 | Packs, Clusters |
-| When deploying a multi-node AWS EKS cluster with the Container Network Interface (CNI) [Calico](../integrations/calico.md), the cluster deployment fails. | A workaround is to use the AWS VPC CNI in the interim while the issue is resolved. | April 14, 2024 | Packs, Clusters |
-| If a Kubernetes cluster deployed onto VMware is deleted, and later re-created with the same name, the cluster creation process fails. The issue is caused by existing resources remaining inside the PCG, or the System PCG, that are not cleaned up during the cluster deletion process. | Refer to the [VMware Resources Remain After Cluster Deletion](../troubleshooting/pcg.md#scenario---vmware-resources-remain-after-cluster-deletion) troubleshooting guide for resolution steps. | April 14, 2024 | Clusters |
-| In a VMware environment, self-hosted Palette instances do not receive a unique cluster ID when deployed, which can cause issues during a node repave event, such as a Kubernetes version upgrade. Specifically, Persistent Volumes (PVs) and Persistent Volume Claims (PVCs) will experience start problems due to the lack of a unique cluster ID. | To resolve this issue, refer to the [Volume Attachment Errors Volume in VMware Environment](../troubleshooting/palette-upgrade.md#volume-attachment-errors-volume-in-vmware-environment) troubleshooting guide. | April 14, 2024 | Self-Hosted |
-| Day-2 operations related to infrastructure changes, such as modifying the node size and count, when using MicroK8s are not taking effect. | No workaround is available. | April 14, 2024 | Packs, Clusters |
-| If a cluster that uses the Rook-Ceph pack experiences network issues, it's possible for the file mount to become and remain unavailable even after the network is restored. | This a known issue disclosed in the [Rook GitHub repository](https://github.com/rook/rook/issues/13818). To resolve this issue, refer to [Rook-Ceph](../integrations/rook-ceph.md#file-mount-becomes-unavailable-after-cluster-experiences-network-issues) pack documentation. | April 14, 2024 | Packs, Edge |
-| Edge clusters on Edge hosts with ARM64 processors may experience instability issues that cause cluster failures. | ARM64 support is limited to a specific set of Edge devices. Currently, Nvidia Jetson devices are supported. | April 14, 2024 | Edge |
-| During the cluster provisioning process of new edge clusters, the Palette webhook pods may not always deploy successfully, causing the cluster to be stuck in the provisioning phase. This issue does not impact deployed clusters. | Review the [Palette Webhook Pods Fail to Start](../troubleshooting/edge.md#scenario---palette-webhook-pods-fail-to-start) troubleshooting guide for resolution steps. | April 14, 2024 | Edge |
+| Description | Workaround | Publish Date | Product Component |
+| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------- | --------------------- |
+| Clusters using Cilium and deployed to VMware environments with the VXLAN tunnel protocol may encounter an I/O timeout error. This issue is caused by the VXMNET3 adapter, which is dropping network traffic and resulting in VXLAN traffic being dropped. You can learn more about this issue in the [Cilium's GitHub issue #21801](https://github.com/cilium/cilium/issues/21801). | Review the [Cilium Troubleshooting](../integrations/cilium.md#io-timeout-error-on-vmware) section for workarounds. | June 27, 2024 | Packs, Clusters, Edge |
+| When you upgrade VerteX from version 4.3.x to 4.4.x, a few system pods may remain unhealthy and experience _CrashLoopBackOff_ errors. This issue only impacts VMware vSphere-based installations and occurs because the internal Mongo DNS is incorrectly configured in the configserver ConfigMap. | Refer to the [Mongo DNS Configmap Value is Incorrect](../troubleshooting/palette-upgrade.md#mongo-dns-configmap-value-is-incorrect) troubleshooting guide for detailed workaround steps. This issue may also impact Enterprise Cluster backup operations. | June 24, 2024 | VerteX |
+| [Sonobuoy](../clusters/cluster-management/compliance-scan.md#conformance-testing) scans fail to generate reports on airgapped Palette Edge clusters. | No workaround is available. | June 24, 2024 | Edge |
+| Clusters configured with OpenID Connect (OIDC) at the Kubernetes layer encounter issues when authenticating with the [non-admin Kubeconfig file](../clusters/cluster-management/kubeconfig.md#cluster-admin). Kubeconfig files using OIDC to authenticate will not work if the SSL certificate is set at the OIDC provider level. | Use the admin Kubeconfig file to authenticate with the cluster, as it does not use OIDC to authenticate. | June 21, 2024 | Clusters |
+| During the platform upgrade from Palette 4.3 to 4.4, Virtual Clusters may encounter a scenario where the pod `palette-controller-manager` is not upgraded to the newer version of Palette. The virtual cluster will continue to be operational, and this does not impact its functionality. | Refer to the [Controller Manager Pod Not Upgraded](../troubleshooting/palette-dev-engine.md#scenario---controller-manager-pod-not-upgraded) troubleshooting guide. | June 15, 2024 | Virtual Clusters |
+| The VerteX enterprise cluster is unable to complete backup operations. | No workaround is available. | June 15, 2024 | VerteX |
+| Edge hosts with FIPS-compliant RHEL Operating System (OS) distribution may encounter the error where the `systemd-resolved.service` service enters the **failed** state. This prevents the nameserver from being configured, which will result in cluster deployment failure. | Refer to [TroubleShooting](../troubleshooting/edge.md#scenario---systemd-resolvedservice-enters-failed-state) for a workaround. | June 15, 2024 | Edge |
+| The GKE cluster's Kubernetes pods are failing to start because the Kubernetes patch version is unavailable. This is encountered during pod restarts or node scaling operations. | Deploy a new cluster and use a GKE cluster profile that does not contain a Kubernetes pack layer with a patch version. Migrate the workloads from the existing cluster to the new cluster. This is a breaking change introduced in Palette 4.4.0 | June 15, 2024 | Packs, Clusters |
+| An issue prevents RKE2 and Palette eXtended Kubernetes (PXK) on version 1.29.4 from operating correctly with Canonical MAAS. | A temporary workaround is using a version lower than 1.29.4 when using MAAS. | June 15, 2024 | Packs, Clusters |
+| [MicroK8s](../integrations/microk8s.md) does not support multi-node control plane clusters. The upgrade strategy, `InPlaceUpgrade`, is the only option available for use. | No workaround is available. | June 15, 2024 | Packs |
+| Clusters using [MicroK8s](../integrations/microk8s.md) as the Kubernetes distribution, the control plane node fails to upgrade when using the `InPlaceUpgrade` strategy for sequential upgrades, such as upgrading from version 1.25.x to version 1.26.x and then to version 1.27.x. | Refer to the [Control Plane Node Fails to Upgrade in Sequential MicroK8s Upgrades](../troubleshooting/pack-issues.md) troubleshooting guide for resolution steps. | June 15, 2024 | Packs |
+| Azure IaaS clusters are having issues with deployed load balancers and ingress deployments when using Kubernetes versions 1.29.0 and 1.29.4. Incoming connections time out as a result due to a lack of network path inside the cluster. Azure AKS clusters are not impacted. | Use a Kubernetes version lower than 1.29.0 | June 12, 2024 | Clusters |
+| OIDC integration with Virtual Clusters is not functional. All other operations related to Virtual Clusters are operational. | No workaround is available. | Jun 11, 2024 | Virtual Clusters |
+| The VerteX enterprise cluster is unable to complete backup operations. | No workaround is available. | June 6, 2024 | VerteX |
+| Deploying self-hosted Palette or VerteX to a vSphere environment fails if vCenter has standalone hosts directly under a Datacenter. Persistent Volume (PV) provisioning fails due to an upstream issue with the vSphere Container Storage Interface (CSI) for all versions before v3.2.0. Palette and VerteX use the vSphere CSI version 3.1.2 internally. The issue may also occur in workload clusters deployed on vSphere using the same vSphere CSI for storage volume provisioning. | If you encounter the following error message when deploying self-hosted Palette or VerteX: `'ProvisioningFailed failed to provision volume with StorageClass "spectro-storage-class". Error: failed to fetch hosts from entity ComputeResource:domain-xyz` then use the following workaround. Remove standalone hosts directly under the Datacenter from vCenter and allow the volume provisioning to complete. After the volume is provisioned, you can add the standalone hosts back. You can also use a service account that does not have access to the standalone hosts as the user that deployed Palette. | May 21, 2024 | Self-Hosted |
+| Conducting cluster node scaling operations on a cluster undergoing a backup can lead to issues and potential unresponsiveness. | To avoid this, ensure no backup operations are in progress before scaling nodes or performing other cluster operations that change the cluster state | April 14, 2024 | Clusters |
+| Palette automatically creates an AWS security group for worker nodes using the format `-node`. If a security group with the same name already exists in the VPC, the cluster creation process fails. | To avoid this, ensure that no security group with the same name exists in the VPC before creating a cluster. | April 14, 2024 | Clusters |
+| K3s version 1.27.7 has been marked as _Deprecated_. This version has a known issue that causes clusters to crash. | Upgrade to a newer version of K3s to avoid the issue, such as versions 1.26.12, 1.28.5, and 1.27.11. You can learn more about the issue in the [K3s GitHub issue](https://github.com/k3s-io/k3s/issues/9047) page. | April 14, 2024 | Packs, Clusters |
+| When deploying a multi-node AWS EKS cluster with the Container Network Interface (CNI) [Calico](../integrations/calico.md), the cluster deployment fails. | A workaround is to use the AWS VPC CNI in the interim while the issue is resolved. | April 14, 2024 | Packs, Clusters |
+| If a Kubernetes cluster deployed onto VMware is deleted, and later re-created with the same name, the cluster creation process fails. The issue is caused by existing resources remaining inside the PCG, or the System PCG, that are not cleaned up during the cluster deletion process. | Refer to the [VMware Resources Remain After Cluster Deletion](../troubleshooting/pcg.md#scenario---vmware-resources-remain-after-cluster-deletion) troubleshooting guide for resolution steps. | April 14, 2024 | Clusters |
+| In a VMware environment, self-hosted Palette instances do not receive a unique cluster ID when deployed, which can cause issues during a node repave event, such as a Kubernetes version upgrade. Specifically, Persistent Volumes (PVs) and Persistent Volume Claims (PVCs) will experience start problems due to the lack of a unique cluster ID. | To resolve this issue, refer to the [Volume Attachment Errors Volume in VMware Environment](../troubleshooting/palette-upgrade.md#volume-attachment-errors-volume-in-vmware-environment) troubleshooting guide. | April 14, 2024 | Self-Hosted |
+| Day-2 operations related to infrastructure changes, such as modifying the node size and count, when using MicroK8s are not taking effect. | No workaround is available. | April 14, 2024 | Packs, Clusters |
+| If a cluster that uses the Rook-Ceph pack experiences network issues, it's possible for the file mount to become and remain unavailable even after the network is restored. | This a known issue disclosed in the [Rook GitHub repository](https://github.com/rook/rook/issues/13818). To resolve this issue, refer to [Rook-Ceph](../integrations/rook-ceph.md#file-mount-becomes-unavailable-after-cluster-experiences-network-issues) pack documentation. | April 14, 2024 | Packs, Edge |
+| Edge clusters on Edge hosts with ARM64 processors may experience instability issues that cause cluster failures. | ARM64 support is limited to a specific set of Edge devices. Currently, Nvidia Jetson devices are supported. | April 14, 2024 | Edge |
+| During the cluster provisioning process of new edge clusters, the Palette webhook pods may not always deploy successfully, causing the cluster to be stuck in the provisioning phase. This issue does not impact deployed clusters. | Review the [Palette Webhook Pods Fail to Start](../troubleshooting/edge.md#scenario---palette-webhook-pods-fail-to-start) troubleshooting guide for resolution steps. | April 14, 2024 | Edge |
## Resolved Known Issues