Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xfsprogs in csi-driver container and on host do not match #2588

Closed
kaitimmer opened this issue Nov 7, 2024 · 11 comments
Closed

xfsprogs in csi-driver container and on host do not match #2588

kaitimmer opened this issue Nov 7, 2024 · 11 comments

Comments

@kaitimmer
Copy link

What happened:
When we try to mount a new XFS volume to a pod (via volumeclaimtemplate) we see the following error:

 AttachVolume.Attach succeeded for volume 'pvc-83d914c5-8359-4a5a-b659-ca6d46344792'
  Warning  FailedMount             7s (x7 over 51s)  kubelet                  MountVolume.MountDevice failed for volume 'pvc-83d914c5-8359-4a5a-b659-ca6d46344792' : rpc error: code = Internal desc = could not format /dev/disk/azure/scsi1/lun0(lun: 0), and mount it at /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/51acc19d37a450db470a345ce2ef9a54140278ef8cef648ba3d28f187913efbb/globalmount, failed with mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t xfs -o noatime,defaults,nouuid,defaults /dev/disk/azure/scsi1/lun0 /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/51acc19d37a450db470a345ce2ef9a54140278ef8cef648ba3d28f187913efbb/globalmount
Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/51acc19d37a450db470a345ce2ef9a54140278ef8cef648ba3d28f187913efbb/globalmount: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error.
       dmesg(1) may have more information after failed mount system call.

What you expected to happen:

Mounting new volumes should just work.

How to reproduce it:

Create a new disk with the following StorageClass:

allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  labels:
    kustomize.toolkit.fluxcd.io/name: kube-system
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  name: default-v2-xfs-noatime
  resourceVersion: "4669108598"
  uid: 7e0be630-073f-461c-9513-9dd3131f578c
mountOptions:
- noatime
- defaults
parameters:
  DiskIOPSReadWrite: "3000"
  DiskMBpsReadWrite: "125"
  cachingMode: None
  fstype: xfs
  storageaccounttype: PremiumV2_LRS
provisioner: disk.csi.azure.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

And mount it to a pod.

Anything else we need to know?:
When we log in to the aks node and execute dmesg we get the following information:

[42449.910213] XFS (sdb): Superblock has unknown incompatible features (0x20) enabled.
[42449.910217] XFS (sdb): Filesystem cannot be safely mounted by this kernel.
[42449.910228] XFS (sdb): SB validate failed with error -22.

This is the output of the xfs disk on the node:

root@aks-zone1node-35463436-vmss000011:/# xfs_db -r /dev/disk/azure/scsi1/lun8
xfs_db version
versionnum [0xbca5+0x18a] = V5,NLINK,DIRV2,ALIGN,LOGV2,EXTFLG,SECTOR,MOREBITS,ATTR2,LAZYSBCOUNT,PROJID32BIT,CRC,FTYPE,FINOBT,SPARSE_INODES,RMAPBT,REFLINK,INOBTCNT,BIGTIME

On other nodes with older xfs volumes, the mount still works. The differences of the xfs format are:

xfs_db version
versionnum [0xbca5+0x18a] = V5,NLINK,DIRV2,ALIGN,LOGV2,EXTFLG,SECTOR,MOREBITS,ATTR2,LAZYSBCOUNT,PROJID32BIT,CRC,FTYPE,FINOBT,SPARSE_INODES,REFLINK,INOBTCNT,BIGTIME

So, the new XFS volumes get the RMAPBT attribute, which seemingly can't be handled in the Ubuntu AKs node image.

Our workaround now is to log in to the Azure node and reformat the volume with 'mkfs.xfs—f/dev/sdX'.

Also, I would assume that this ffbeb55 might already be the hotfix for it. So, I'm just raising this for awareness so that others do not spend hours finding the issue on their end.

I do not think this is the best way to fix this, but it'll do as a quick solution.

A proper solution might be to mount the tools directly from the host into the container so that this version mismatch does not happen again.
Or do it like this: https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/Dockerfile which also prevents this.

Can we please release a new version and release the fix?

Environment:

  • CSI Driver version: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.30.5
  • Kubernetes version (use kubectl version): 1.30.3
  • OS (e.g. from /etc/os-release): AKSUbuntu-2204gen2containerd-202410.15.0
@andyzhangx
Copy link
Member

andyzhangx commented Nov 7, 2024

thanks for providing the info. @kaitimmer the problem is that AKS node is still using Ubuntu 5.15 kernel which does not support xfs RMAPBT attribute (kernel 6.x supports), while the new alpine image 3.20.2 contains this new xfs RMAPBT attribute, that makes the incompatibility. I would rather revert to old alpine image.

if you have AKS managed csi driver, and want to revert to v1.30.4, just mails me, thanks.

Note that this bug only impacts new xfs disk using azure disk csi driver v1.30.5

@kaitimmer
Copy link
Author

if you have AKS managed csi driver, and want to revert to v1.30.4, just mails me, thanks.

I reached out to you for our specific clusters via Email.
Thanks for your help!

@monotek
Copy link
Member

monotek commented Nov 7, 2024

@andyzhangx
Can you estimate wow long would we need to stay on the old version of the CSI Driver docker image?
I guess it would mean CSI driver container updates would be disabled for the time beeing?
Would you advised to use Azure Linux instead of Ubuntu?

@andyzhangx
Copy link
Member

@andyzhangx Can you estimate wow long would we need to stay on the old version of the CSI Driver docker image? I guess it would mean CSI driver container updates would be disabled for the time beeing? Would you advised to use Azure Linux instead of Ubuntu?

@monotek we will upgrade to alpine base image 3.18.9 which also fixes the CVE, here is the PR: #2590
btw, azure linux or Ubuntu makes no difference, those two AKS supported node images do not support xfs RMAPBT attribute since they are both on kernel 5.15

@monotek
Copy link
Member

monotek commented Nov 7, 2024

Ah, ok, Thanks! :)

We were not sure about that as we saw Kernel 6.6 here too: https://github.com/microsoft/azurelinux/releases/tag/3.0.20240824-3.0

So I guess the aks nodes would still use Azure Linux 2.x?

@andyzhangx
Copy link
Member

andyzhangx commented Nov 7, 2024

Ah, ok, Thanks! :)

We were not sure about that as we saw Kernel 6.6 here too: https://github.com/microsoft/azurelinux/releases/tag/3.0.20240824-3.0

So I guess the aks nodes would still use Azure Linux 2.x?

@monotek Azure Linux 3.x (preview) is on kernel 6.6, while Azure Linux 2.x is on kernel 5.15

  • not working versions
/ # apk list xfsprogs
WARNING: opening from cache https://dl-cdn.alpinelinux.org/alpine/v3.20/main: No such file or directory
WARNING: opening from cache https://dl-cdn.alpinelinux.org/alpine/v3.20/community: No such file or directory
xfsprogs-6.8.0-r0 x86_64 {xfsprogs} (LGPL-2.1-or-later) [installed]
/ # apk list xfsprogs-extra
WARNING: opening from cache https://dl-cdn.alpinelinux.org/alpine/v3.20/main: No such file or directory
WARNING: opening from cache https://dl-cdn.alpinelinux.org/alpine/v3.20/community: No such file or directory
xfsprogs-extra-6.8.0-r0 x86_64 {xfsprogs} (LGPL-2.1-or-later) [installed]
  • working versions
/ # apk list xfsprogs
WARNING: opening from cache https://dl-cdn.alpinelinux.org/alpine/v3.18/main: No such file or directory
WARNING: opening from cache https://dl-cdn.alpinelinux.org/alpine/v3.18/community: No such file or directory
xfsprogs-6.2.0-r2 x86_64 {xfsprogs} (LGPL-2.1-or-later) [installed]
/ # apk list xfsprogs-extra
WARNING: opening from cache https://dl-cdn.alpinelinux.org/alpine/v3.18/main: No such file or directory
WARNING: opening from cache https://dl-cdn.alpinelinux.org/alpine/v3.18/community: No such file or directory
xfsprogs-extra-6.2.0-r2 x86_64 {xfsprogs} (LGPL-2.1-or-later) [installed]

and unfortunately we cannot downgrade to a specific version in higher alpine base image:

#0 1.476 ERROR: unable to select packages:
#0 1.478   xfsprogs-6.8.0-r0:
#0 1.478     breaks: world[xfsprogs=6.2.0-r2]
#0 1.478     satisfies: xfsprogs-extra-6.8.0-r0[xfsprogs]
#0 1.479   xfsprogs-extra-6.8.0-r0:
#0 1.479     breaks: world[xfsprogs-extra=6.2.0-r2]
------
Dockerfile:16
--------------------
  15 |     FROM alpine:3.20.3
  16 | >>> RUN apk upgrade --available --no-cache && \
  17 | >>>     apk add --no-cache util-linux e2fsprogs e2fsprogs-extra ca-certificates udev xfsprogs=6.2.0-r2 xfsprogs-extra==6.2.0-r2 btrfs-progs btrfs-progs-extra

@ctrmcubed
Copy link

I appreciate the help from @andyzhangx in resetting the azuredisk-csi to v1.30.4 on our cluster. However we restart the cluster daily and each restart upgrades to v1.30.5 again.

Is there an enduring solution?

@andyzhangx
Copy link
Member

@ctrmcubed the hotfix has been rolled out complete in northeurope and westeurope regions now, pls check. we will also rollout on other regions next.

@andyzhangx
Copy link
Member

andyzhangx commented Nov 20, 2024

btw, the issue is only on formatting new xfs PVC disk (no data loss risk here), if your cluster has been restored with fix ( with CSI driver v1.30.6 or v1.30.4), you need to delete existing broken xfs PVC, and then create new PVC again with the fixed CSI driver version. (only Azure disk CSI driver v1.30.5 is broken here)

@ctrmcubed
Copy link

@ctrmcubed the hotfix has been rolled out complete in northeurope and westeurope regions now, pls check. we will also rollout on other regions next.

Confirmed this is now working in my region using v1.30.6.

@andyzhangx
Copy link
Member

this issue has been fixed on all regions now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants