Skip to content

Commit

Permalink
docs: DOC-1500: "Too many open files" Troubleshooting (#4967)
Browse files Browse the repository at this point in the history
* Initial commit for too many open files error

* Additional troubleshooting tweaks; fixed incorrect CI/CD reference

* Fixed broken links due to updated heading

* Adjusted PCG verbiage and added x-ref

* Fixed code block indentation

* ci: auto-formatting prettier issues

* Incorporating suggestions from Ben

* Incorporates additional suggestions per Carolina

* ci: auto-formatting prettier issues

---------

Co-authored-by: achuribooks <[email protected]>
  • Loading branch information
achuribooks and achuribooks authored Dec 11, 2024
1 parent 99a65d0 commit 620b471
Show file tree
Hide file tree
Showing 9 changed files with 64 additions and 10 deletions.
2 changes: 1 addition & 1 deletion docs/docs-content/automation/palette-cli/palette-cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ tags: ["palette-cli"]
---

The Palette CLI contains various functionalities that you can use to interact with Palette and manage resources. The
Palette CLI is well suited for Continuous Delivery/Continuous Deployment (CI/CD) pipelines and recommended for
Palette CLI is well suited for Continuous Integration/Continuous Deployment (CI/CD) pipelines and recommended for
automation tasks, where Terraform or direct API queries are not ideal.

To get started with the Palette CLI, check out the [Install](install-palette-cli.md) guide.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ Palette 4.0 includes the following major enhancements that require user interven
A known issue impacts all self-hosted Palette instances older then 4.4.14. Before upgrading a Palette instance with
version older than 4.4.14, ensure that you execute a utility script to make all your cluster IDs unique in your
Persistent Volume Claim (PVC) metadata. For more information, refer to the
[Troubleshooting Guide](../../troubleshooting/enterprise-install.md#non-unique-vsphere-cns-mapping).
[Troubleshooting Guide](../../troubleshooting/enterprise-install.md#scenario---non-unique-vsphere-cns-mapping).

:::

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ details.

If you are upgrading from a Palette version that is older than 4.4.14, ensure that you have executed the utility script
to make the CNS mapping unique for the associated PVC. For more information, refer to the
[Troubleshooting guide](../../../troubleshooting/enterprise-install.md#non-unique-vsphere-cns-mapping).
[Troubleshooting guide](../../../troubleshooting/enterprise-install.md#scenario---non-unique-vsphere-cns-mapping).

:::

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ version available. Refer to the [Supported Upgrade Paths](../upgrade.md#supporte

If you are upgrading from a Palette version that is older than 4.4.14, ensure that you have executed the utility script
to make the CNS mapping unique for the associated PVC. For more information, refer to the
[Troubleshooting guide](../../../troubleshooting/enterprise-install.md#non-unique-vsphere-cns-mapping).
[Troubleshooting guide](../../../troubleshooting/enterprise-install.md#scenario---non-unique-vsphere-cns-mapping).

:::

Expand Down
2 changes: 1 addition & 1 deletion docs/docs-content/release-notes/known-issues.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ The following table lists all known issues that are currently active and affecti
| If an Edge host operating a cluster in connected mode loses connection to Palette, the cluster will not auto-renew its Public Key Infrastructure (PKI) certificates. When it re-establishes the connection to Palette, the Edge host will renew the certificates if the existing certificates have less than 30 days before expiry. | No workaround available. | September 14, 2024 | Edge |
| Using the Flannel Container Network Interface (CSI) pack together with a Red Hat Enterprise Linux (RHEL)-based provider image may lead to a pod becoming stuck during deployment. This is caused by an upstream issue with Flannel that was discovered in a K3s GitHub issue. Refer to [the K3s issue page](https://github.com/k3s-io/k3s/issues/5013) for more information. | No workaround is available | September 14, 2024 | Edge |
| Palette OVA import operations fail if the VMO cluster is using a storageClass with the volume bind method `WaitForFirstConsumer`. | Refer to the [OVA Imports Fail Due To Storage Class Attribute](../troubleshooting/vmo-issues.md#scenario---ova-imports-fail-due-to-storage-class-attribute) troubleshooting guide for workaround steps. | September 13, 2024 | Palette CLI, VMO |
| Persistent Volume Claims (PVCs) metadata do not use a unique identifier for self-hosted Palette clusters. This causes incorrect Cloud Native Storage (CNS) mappings in vSphere, potentially leading to issues during node operations and cluster upgrades. | Refer to the [Troubleshooting section](../troubleshooting/enterprise-install.md#non-unique-vsphere-cns-mapping) for guidance. | September 13, 2024 | Self-hosted |
| Persistent Volume Claims (PVCs) metadata do not use a unique identifier for self-hosted Palette clusters. This causes incorrect Cloud Native Storage (CNS) mappings in vSphere, potentially leading to issues during node operations and cluster upgrades. | Refer to the [Troubleshooting section](../troubleshooting/enterprise-install.md#scenario---non-unique-vsphere-cns-mapping) for guidance. | September 13, 2024 | Self-hosted |
| Third-party binaries downloaded and used by the Palette CLI may become stale and incompatible with the CLI. | Refer to the [Incompatible Stale Palette CLI Binaries](../troubleshooting/automation.md#scenario---incompatible-stale-palette-cli-binaries) troubleshooting guide for workaround guidance. | September 11, 2024 | CLI |
| An issue with Edge hosts using [Trusted Boot](../clusters/edge/trusted-boot/trusted-boot.md) and encrypted drives occurs when TRIM is not enabled. As a result, Solid-State Drive and Nonvolatile Memory Express drives experience degraded performance and potentially cause cluster failures. This [issue](https://github.com/kairos-io/kairos/issues/2693) stems from [Kairos](https://kairos.io/) not passing through the `--allow-discards` flag to the `systemd-cryptsetup attach` command. | Check out the [Degreated Performance on Disk Drives](../troubleshooting/edge.md#scenario---degreated-performance-on-disk-drives) troubleshooting guide for guidance on workaround. | September 4, 2024 | Edge |
| The AWS CSI pack has a [Pod Disruption Budget](https://kubernetes.io/docs/tasks/run-application/configure-pdb/) (PDB) that allows for a maximum of one unavailable pod. This behavior causes an issue for single-node clusters as well as clusters with a single control plane node and a single worker node where the control plane lacks worker capability. [Operating System (OS) patch](../clusters/cluster-management/os-patching.md) updates may attempt to evict the CSI controller without success, resulting in the node remaining in the un-schedulable state. | If OS patching is enabled, allow the control plane nodes to have worker capability. For single-node clusters, turn off the OS patching feature. | September 4, 2024 | Cluster, Packs |
Expand Down
58 changes: 56 additions & 2 deletions docs/docs-content/troubleshooting/enterprise-install.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ tags: ["troubleshooting", "self-hosted", "palette", "vertex"]

Refer to the following sections to troubleshoot errors encountered when installing an Enterprise Cluster.

## Scenario - Self-linking Error
## Scenario - Self-Linking Error

When installing an Enterprise Cluster, you may encounter an error stating that the enterprise cluster is unable to
self-link. Self-linking is the process of Palette or VerteX becoming aware of the Kubernetes cluster it is installed on.
Expand Down Expand Up @@ -78,7 +78,7 @@ following steps to restart the management pod.
pod "mgmt-f7f97f4fd-lds69" deleted
```

## Non-unique vSphere CNS Mapping
## Scenario - Non-Unique vSphere CNS Mapping

In Palette and VerteX releases 4.4.8 and earlier, Persistent Volume Claims (PVCs) metadata do not use a unique
identifier for self-hosted Palette clusters. This causes incorrect Cloud Native Storage (CNS) mappings in vSphere,
Expand Down Expand Up @@ -156,3 +156,57 @@ automatically resolve this issue. If you have self-hosted instances of Palette i
Events: <none>
```

## Scenario - "Too Many Open Files" in Cluster

When viewing logs for Enterprise or [Private Cloud Gateway](../clusters/pcg/pcg.md) clusters, you may encounter a "too
many open files" error, which prevents logs from tailing after a certain point. To resolve this issue, you must increase
the maximum number of file descriptors for each node on your cluster.

### Debug Steps

Repeat the following process for each node in your cluster.

1. Log in to a node in your cluster.

```bash
ssh -i <key-name> <spectro@hostname>
```

2. Switch to `sudo` mode using the command that best fits your system and preferences.

```bash
sudo --login
```

3. Increase the maximum number of file descriptors that the kernel can allocate system-wide.

```bash
echo "fs.file-max = 1000000" > /etc/sysctl.d/99-maxfiles.conf
```

4. Apply the updated `sysctl` settings. The increased limit is returned.

```bash
sysctl -p /etc/sysctl.d/99-maxfiles.conf
```

```bash hideClipboard
fs.file-max = 1000000
```

5. Restart the `kubelet` and `containerd` services.

```bash
systemctl restart kubelet containerd
```

6. Confirm that the change was applied.

```bash
sysctl fs.file-max
```

```bash hideClipboard
fs.file-max = 1000000
```
2 changes: 1 addition & 1 deletion docs/docs-content/vertex/upgrade/upgrade-notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,4 @@ troubleshooting guide for resolution steps.
A known issue impacts all self-hosted Palette instances older then 4.4.14. Before upgrading an Palette instance with
version older than 4.4.14, ensure that you execute a utility script to make all your cluster IDs unique in your
Persistent Volume Claim (PVC) metadata. For more information, refer to the
[Troubleshooting Guide](../../troubleshooting/enterprise-install.md#non-unique-vsphere-cns-mapping).
[Troubleshooting Guide](../../troubleshooting/enterprise-install.md#scenario---non-unique-vsphere-cns-mapping).
2 changes: 1 addition & 1 deletion docs/docs-content/vertex/upgrade/upgrade-vmware/airgap.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ section for details.

If you are upgrading from a Palette VerteX version that is older than 4.4.14, ensure that you have executed the utility
script to make the CNS mapping unique for the associated PVC. For more information, refer to the
[Troubleshooting guide](../../../troubleshooting/enterprise-install.md#non-unique-vsphere-cns-mapping).
[Troubleshooting guide](../../../troubleshooting/enterprise-install.md#scenario---non-unique-vsphere-cns-mapping).

:::

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ for details.

If you are upgrading from a Palette VerteX version that is older than 4.4.14, ensure that you have executed the utility
script to make the CNS mapping unique for the associated PVC. For more information, refer to the
[Troubleshooting guide](../../../troubleshooting/enterprise-install.md#non-unique-vsphere-cns-mapping).
[Troubleshooting guide](../../../troubleshooting/enterprise-install.md#scenario---non-unique-vsphere-cns-mapping).

:::

Expand Down

0 comments on commit 620b471

Please sign in to comment.