Skip to content
This repository has been archived by the owner on Mar 6, 2019. It is now read-only.
This repository is currently being migrated. It's locked while the migration is in progress.

"unable to open block device" random error #13

Open
tscolari opened this issue May 13, 2018 · 6 comments
Open

"unable to open block device" random error #13

tscolari opened this issue May 13, 2018 · 6 comments

Comments

@tscolari
Copy link

This error seems to be random, some of my pvc works, but eventually one pod will get this error:
It can happen either on fresh helm deployment or on one running for a few days

In the pod using the volume/pvc:

Warning  FailedMount            34s               kubelet, k8s-node-2  Unable to mount volumes for pod "registry-6786df68c4-gxgp2_services(653d5612-56c5-11e8-bd87-de2b040ce014)": timeout expired waiting for volumes to attach or mount for pod "services"/"registry-6786df68c4-gxgp2". list of unmounted volumes=[registry-storage]. list of unattached volumes=[registry-storage credentials-volume default-token-bsjcf]
  Warning  FailedMount            26s (x9 over 2m)  kubelet, k8s-node-2  MountVolume.SetUp failed for volume "pvc-652a3868-56c5-11e8-bd87-de2b040ce014" : stat /var/lib/storageos/volumes/04cc0bb7-0934-3f78-96f4-b1882888abcc: no such file or directory

Log from the storageos pod:

time="2018-05-13T15:50:55Z" level=info msg="[[email protected]:5703]:: attempt to establish channel" category=clconmon module=supervisor
time="2018-05-13T15:50:56Z" level=warning msg="unable to open block device /var/lib/storageos/volumes/04cc0bb7-0934-3f78-96f4-b1882888abcc-sys: No such device or address" category=clconmon module=supervisor
time="2018-05-13T15:50:56Z" level=info msg="[[email protected]:5703]: connection established (fd=6)" category=clconn module=supervisor

Not sure how to further debug this to give you more information, but let me know if I can help, or if you suspect that I misconfigured something

@greenaar
Copy link

I have also witnessed this. 3 nodes Same hardware. 2 of them have the volume path, the third does not. This is also accompanied by an increase in system load (one system reached a load avg of 85, without ever creating the required folder structure. This occurs semi-regularly when autoprovisioning, but the node that has the failure is random. Ubuntu 18.04, docker 17.12.1-0ubuntu1, kubernetes v1.10.4, latest helm, overlay2 mode, weave CNI.

Not knowing exactly, i suspect it's something to do with this:

Jun 18 15:23:20 k8s2 kernel: [763686.657279] sd 6:0:1:8: [sdc] 10485760 512-byte logical blocks: (5.37 GB/5.00 GiB)
Jun 18 15:23:20 k8s2 kernel: [763686.657290] sd 6:0:1:8: [sdc] Write Protect is off
Jun 18 15:23:20 k8s2 kernel: [763686.657292] sd 6:0:1:8: [sdc] Mode Sense: 43 00 00 08
Jun 18 15:23:20 k8s2 kernel: [763686.657311] sd 6:0:1:8: [sdc] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
Jun 18 15:23:20 k8s2 kernel: [763686.657315] loopback/naa.6001405005050001: Unsupported SCSI Opcode 0xa3, sending CHECK_CONDITION.
Jun 18 15:23:50 k8s2 kernel: [763717.002495] ABORT_TASK: Found referenced loopback task_tag: 0
Jun 18 15:24:44 k8s2 kernel: [763770.697823] fd_do_rw() returned -5 for non S_ISBLK

I'd had to roll back the install for the moment, but I'd be happy to re-test this in the future. It almost works for me, and I'm optimistic.

@oicheryl
Copy link
Contributor

oicheryl commented Jun 20, 2018

Thanks for the report @greenaar, we'll look into this. Do you have repro steps?

Can you do a lsmod | grep tcm and see if you have the tcm_loop module enabled?
Can you run docker run -it --rm -v /mnt:/mnt:shared busybox sh -c /bin/date to check if mount propagation is enabled in Docker?
Can you share the full logs from storageos?

@greenaar
Copy link

i will re-deploy it in the next day or so, and i should be able to get the dumps out for you.

My repro steps were really easy. Install via helm (the only change, an entry for join:)

Create this volume claim:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: shell-pvc
annotations:
volume.beta.kubernetes.io/storage-class: fast
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi

and attach it to this deployment:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: shell
namespace: default
spec:
replicas: 1
strategy:
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
io.kompose.service: shell
spec:
containers:
- image: ubuntu:16.04
command:
- tail
- "-f"
- "/etc/hostname"
name: shell
volumeMounts:
- name: shell
mountPath: '/data'
restartPolicy: Always
volumes:
- name: shell
persistentVolumeClaim:
claimName: shell-pvc

Result: Mount point is created on 2 of my 3 nodes, but not the node the pod is provisioning on. That one gets the fw_do_rw() log message and the ABORT_TASK, and has its' load gradually climb up to near 100. Note the host in question varied, but failure was almost assured.

I will add those other items you asked for as soon as I can. Oh, I should mention, these machines -- bare metal, not virtual.

@Arau
Copy link
Contributor

Arau commented Jun 22, 2018

Hi @greenaar ,

is it feasible that you can provide us with the logs of the pod running in the instance that have issues?

@greenaar
Copy link

Sorry for the delay:

root@c1:/share/projects/kubernetes/personal/testing/storageos-charts# lsmod | grep tcm
tcm_loop 24576 3
target_core_mod 360448 8 target_core_iblock,tcm_loop,target_core_user,target_core_file,target_core_pscsi

Module is loaded

root@c1:/share/projects/kubernetes/personal/testing/storageos-charts# docker run -it --rm -v /mnt:/mnt:shared busybox sh -c /bin/date
Unable to find image 'busybox:latest' locally
latest: Pulling from library/busybox
07a152489297: Already exists
Digest: sha256:141c253bc4c3fd0a201d32dc1f493bcf3fff003b6df416dea4f41046e0f37d47
Status: Downloaded newer image for busybox:latest
Sat Jun 23 01:34:20 UTC 2018

Propogation is enabled.

The moment before this system went off the rails, this appeared in syslog:

Jun 22 18:31:37 c1 kernel: [ 1745.323064] scsi host4: TCM_Loopback
Jun 22 18:31:37 c1 kernel: [ 1745.817046] scsi 4:0:1:1: Direct-Access LIO-ORG FILEIO 4.0 PQ: 0 ANSI: 5
Jun 22 18:31:37 c1 kernel: [ 1745.817182] sd 4:0:1:1: [sdb] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB)
Jun 22 18:31:37 c1 kernel: [ 1745.817198] sd 4:0:1:1: [sdb] Write Protect is off
Jun 22 18:31:37 c1 kernel: [ 1745.817199] sd 4:0:1:1: [sdb] Mode Sense: 43 00 00 08
Jun 22 18:31:37 c1 kernel: [ 1745.817211] sd 4:0:1:1: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
Jun 22 18:31:37 c1 kernel: [ 1745.817214] loopback/naa.6001405005050001: Unsupported SCSI Opcode 0xa3, sending CHECK_CONDITION.
Jun 22 18:31:37 c1 kernel: [ 1745.817332] sd 4:0:1:1: Attached scsi generic sg2 type 0
Jun 22 18:31:37 c1 kernel: [ 1745.817714] sd 4:0:1:1: [sdb] Attached SCSI disk
Jun 22 18:31:37 c1 kernel: [ 1745.914684] scsi host3: TCM_Loopback
Jun 22 18:31:44 c1 systemd-udevd[23920]: inotify_add_watch(9, /dev/sdb, 10) failed: No such file or directory

load average climbed to about 15 or so, and interrupted both weave networking and the connection to the api.

Interestingly, i then deleted the PVC, and re-applied it. load avg of 5, and it created it on all 3 nodes.

tl;dr -- it works.. sometimes. When it doesn't, it blocks on a node -- that includes blocking on removal when using helm delete --purge, requiring a system reboot to clear.

I realize this is not a lot of useful information, but i hope it helps in some way. I look forward to your successes.

@Arau
Copy link
Contributor

Arau commented Jun 25, 2018

Hi @greenaar, thank you very much for all this information. We are working on troubleshooting this issue. We keep you posted!.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants