-
Notifications
You must be signed in to change notification settings - Fork 27
"unable to open block device" random error #13
Comments
I have also witnessed this. 3 nodes Same hardware. 2 of them have the volume path, the third does not. This is also accompanied by an increase in system load (one system reached a load avg of 85, without ever creating the required folder structure. This occurs semi-regularly when autoprovisioning, but the node that has the failure is random. Ubuntu 18.04, docker 17.12.1-0ubuntu1, kubernetes v1.10.4, latest helm, overlay2 mode, weave CNI. Not knowing exactly, i suspect it's something to do with this: Jun 18 15:23:20 k8s2 kernel: [763686.657279] sd 6:0:1:8: [sdc] 10485760 512-byte logical blocks: (5.37 GB/5.00 GiB) I'd had to roll back the install for the moment, but I'd be happy to re-test this in the future. It almost works for me, and I'm optimistic. |
Thanks for the report @greenaar, we'll look into this. Do you have repro steps? Can you do a |
i will re-deploy it in the next day or so, and i should be able to get the dumps out for you. My repro steps were really easy. Install via helm (the only change, an entry for join:) Create this volume claim:apiVersion: v1 and attach it to this deployment: apiVersion: extensions/v1beta1 Result: Mount point is created on 2 of my 3 nodes, but not the node the pod is provisioning on. That one gets the fw_do_rw() log message and the ABORT_TASK, and has its' load gradually climb up to near 100. Note the host in question varied, but failure was almost assured. I will add those other items you asked for as soon as I can. Oh, I should mention, these machines -- bare metal, not virtual. |
Hi @greenaar , is it feasible that you can provide us with the logs of the pod running in the instance that have issues? |
Sorry for the delay: root@c1:/share/projects/kubernetes/personal/testing/storageos-charts# lsmod | grep tcm Module is loaded root@c1:/share/projects/kubernetes/personal/testing/storageos-charts# docker run -it --rm -v /mnt:/mnt:shared busybox sh -c /bin/date Propogation is enabled. The moment before this system went off the rails, this appeared in syslog: Jun 22 18:31:37 c1 kernel: [ 1745.323064] scsi host4: TCM_Loopback load average climbed to about 15 or so, and interrupted both weave networking and the connection to the api. Interestingly, i then deleted the PVC, and re-applied it. load avg of 5, and it created it on all 3 nodes. tl;dr -- it works.. sometimes. When it doesn't, it blocks on a node -- that includes blocking on removal when using helm delete --purge, requiring a system reboot to clear. I realize this is not a lot of useful information, but i hope it helps in some way. I look forward to your successes. |
Hi @greenaar, thank you very much for all this information. We are working on troubleshooting this issue. We keep you posted!. |
This error seems to be random, some of my pvc works, but eventually one pod will get this error:
It can happen either on fresh helm deployment or on one running for a few days
In the pod using the volume/pvc:
Log from the storageos pod:
Not sure how to further debug this to give you more information, but let me know if I can help, or if you suspect that I misconfigured something
The text was updated successfully, but these errors were encountered: