Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot start any VM/qube with PCI device passed through to it after Dynamic Launch #18

Open
miczyg1 opened this issue Sep 17, 2024 · 5 comments

Comments

@miczyg1
Copy link

miczyg1 commented Sep 17, 2024

The domains with PCI devices passed through to it crash:

(XEN) Pagetable walk from ffffc90000207ff8:
(XEN)  L4[0x192] = 00000003183b9067 0000000000008fb9
(XEN)  L3[0x000] = 00000003183b8067 0000000000008fb8
(XEN)  L2[0x001] = 0000000321051067 0000000000000051
(XEN)  L1[0x007] = 0000000000000000 ffffffffffffffff
(XEN) domain_crash_sync called from entry.S: fault at ffff82d040357250 x86_64/entry.S#domain_crash_page_fault_6x8+0/0x4
(XEN) Domain 2 (vcpu#0) crashed on cpu#8:
(XEN) ----[ Xen-4.17.4  x86_64  debug=n  Tainted:  M     ]----
(XEN) CPU:    8
(XEN) RIP:    e033:[<ffffffff8124998b>]
(XEN) RFLAGS: 0000000000010202   EM: 1   CONTEXT: pv guest (d2v0)
(XEN) rax: 0000000000000000   rbx: ffffc90000208070   rcx: ffffc90000208070
(XEN) rdx: ffffffff818428a8   rsi: ffffc90000208018   rdi: ffffffff818428a8
(XEN) rbp: ffffffff81ab30e0   rsp: ffffc90000208000   r8:  ffffffff81242998
(XEN) r9:  0000000000000000   r10: ffffffff818428a8   r11: 0000000000000000
(XEN) r12: ffffffff818428a8   r13: 0000000000001fe0   r14: ffffc90000208070
(XEN) r15: ffffffff81ab1100   cr0: 0000000080050033   cr4: 0000000000b526e0
(XEN) cr3: 000000031f61d000   cr2: ffffc90000207ff8
(XEN) fsb: 0000000000000000   gsb: ffffffff81a37000   gss: 0000000000000000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e02b   cs: e033
(XEN) Guest stack trace from rsp=ffffc90000208000:
(XEN)   Stack empty.

/var/log/libvirt/libxl/libxl-driver.log shows:

2024-08-25 18:45:33.363+0000: libxl: libxl_pci.c:1587:libxl__device_pci_reset: The kernel doesn't support reset from sysfs for PCI device 0000:00:0d.0
2024-08-25 18:45:33.375+0000: libxl: libxl_pci.c:1587:libxl__device_pci_reset: The kernel doesn't support reset from sysfs for PCI device 0000:00:0d.2
2024-08-25 18:45:33.379+0000: libxl: libxl_pci.c:1587:libxl__device_pci_reset: The kernel doesn't support reset from sysfs for PCI device 0000:00:14.0
2024-08-25 18:50:00.305+0000: libxl: libxl_pci.c:1587:libxl__device_pci_reset: The kernel doesn't support reset from sysfs for PCI device 0000:00:0d.0
2024-08-25 18:50:00.419+0000: libxl: libxl_pci.c:1587:libxl__device_pci_reset: The kernel doesn't support reset from sysfs for PCI device 0000:00:0d.2
2024-08-25 18:50:00.533+0000: libxl: libxl_pci.c:1587:libxl__device_pci_reset: The kernel doesn't support reset from sysfs for PCI device 0000:00:14.0
2024-08-25 18:50:01.069+0000: libxl: libxl_pci.c:2098:do_pci_remove: Domain 6:xc_physdev_unmap_pirq irq=17: Invalid argument
2024-08-25 18:50:52.445+0000: libxl: libxl_pci.c:1587:libxl__device_pci_reset: The kernel doesn't support reset from sysfs for PCI device 0000:00:0d.0
2024-08-25 18:50:52.465+0000: libxl: libxl_pci.c:1587:libxl__device_pci_reset: The kernel doesn't support reset from sysfs for PCI device 0000:00:0d.2
2024-08-25 18:50:52.470+0000: libxl: libxl_pci.c:1587:libxl__device_pci_reset: The kernel doesn't support reset from sysfs for PCI device 0000:00:14.0
2024-09-13 19:06:21.778+0000: libxl: libxl_pci.c:1587:libxl__device_pci_reset: The kernel doesn't support reset from sysfs for PCI device 0000:00:0d.0
2024-09-13 19:06:21.891+0000: libxl: libxl_pci.c:1587:libxl__device_pci_reset: The kernel doesn't support reset from sysfs for PCI device 0000:00:0d.2
2024-09-13 19:06:21.999+0000: libxl: libxl_pci.c:1587:libxl__device_pci_reset: The kernel doesn't support reset from sysfs for PCI device 0000:00:14.0
2024-09-13 19:06:22.554+0000: libxl: libxl_pci.c:2098:do_pci_remove: Domain 2:xc_physdev_unmap_pirq irq=17: Invalid argument
2024-09-13 19:10:47.668+0000: libxl: libxl_dm.c:2857:stubdom_xswait_cb: Domain 1:Stubdom 2 for 1 startup: startup timed out
2024-09-13 19:10:47.677+0000: libxl: libxl_create.c:1975:domcreate_devmodel_started: Domain 1:device model did not start: -9
2024-09-13 19:10:57.827+0000: libxl: libxl_device.c:1489:libxl__wait_for_backend: Backend /local/domain/0/backend/pci/2/0 not ready
2024-09-13 19:10:57.830+0000: libxl: libxl_pci.c:2098:do_pci_remove: Domain 2:xc_physdev_unmap_pirq irq=17: Invalid argument
2024-09-13 19:11:07.994+0000: libxl: libxl_device.c:1489:libxl__wait_for_backend: Backend /local/domain/0/backend/pci/2/0 not ready
2024-09-13 19:11:10.129+0000: libxl: libxl_pci.c:1587:libxl__device_pci_reset: The kernel doesn't support reset from sysfs for PCI device 0000:00:0d.0
2024-09-13 19:11:10.141+0000: libxl: libxl_pci.c:1587:libxl__device_pci_reset: The kernel doesn't support reset from sysfs for PCI device 0000:00:0d.2
2024-09-13 19:11:10.145+0000: libxl: libxl_pci.c:1587:libxl__device_pci_reset: The kernel doesn't support reset from sysfs for PCI device 0000:00:14.0
2024-09-13 19:11:40.177+0000: libxl: libxl_dm.c:2857:stubdom_xswait_cb: Domain 3:Stubdom 4 for 3 startup: startup timed out
2024-09-13 19:11:40.177+0000: libxl: libxl_create.c:1975:domcreate_devmodel_started: Domain 3:device model did not start: -9
2024-09-13 19:11:40.182+0000: libxl: libxl_pci.c:1587:libxl__device_pci_reset: The kernel doesn't support reset from sysfs for PCI device 0000:00:0d.0
2024-09-13 19:11:50.217+0000: libxl: libxl_device.c:1489:libxl__wait_for_backend: Backend /local/domain/0/backend/pci/4/0 not ready
2024-09-13 19:11:50.220+0000: libxl: libxl_pci.c:1587:libxl__device_pci_reset: The kernel doesn't support reset from sysfs for PCI device 0000:00:0d.2
2024-09-13 19:12:00.254+0000: libxl: libxl_device.c:1489:libxl__wait_for_backend: Backend /local/domain/0/backend/pci/4/0 not ready
2024-09-13 19:12:00.257+0000: libxl: libxl_pci.c:1587:libxl__device_pci_reset: The kernel doesn't support reset from sysfs for PCI device 0000:00:14.0
2024-09-13 19:12:10.290+0000: libxl: libxl_device.c:1489:libxl__wait_for_backend: Backend /local/domain/0/backend/pci/4/0 not ready
2024-09-13 19:12:42.880+0000: libxl: libxl_dm.c:2857:stubdom_xswait_cb: Domain 5:Stubdom 6 for 5 startup: startup timed out
2024-09-13 19:12:42.880+0000: libxl: libxl_create.c:1975:domcreate_devmodel_started: Domain 5:device model did not start: -9
2024-09-13 19:12:53.032+0000: libxl: libxl_device.c:1489:libxl__wait_for_backend: Backend /local/domain/0/backend/pci/6/0 not ready
2024-09-13 19:12:53.034+0000: libxl: libxl_pci.c:2098:do_pci_remove: Domain 6:xc_physdev_unmap_pirq irq=17: Invalid argument
2024-09-13 19:13:03.207+0000: libxl: libxl_device.c:1489:libxl__wait_for_backend: Backend /local/domain/0/backend/pci/6/0 not ready
2024-09-13 19:13:36.313+0000: libxl: libxl_dm.c:2857:stubdom_xswait_cb: Domain 7:Stubdom 8 for 7 startup: startup timed out
2024-09-13 19:13:36.313+0000: libxl: libxl_create.c:1975:domcreate_devmodel_started: Domain 7:device model did not start: -9
2024-09-13 19:13:46.468+0000: libxl: libxl_device.c:1489:libxl__wait_for_backend: Backend /local/domain/0/backend/pci/8/0 not ready
2024-09-13 19:13:46.471+0000: libxl: libxl_pci.c:2098:do_pci_remove: Domain 8:xc_physdev_unmap_pirq irq=17: Invalid argument
2024-09-13 19:13:56.640+0000: libxl: libxl_device.c:1489:libxl__wait_for_backend: Backend /local/domain/0/backend/pci/8/0 not ready
2024-09-13 19:15:20.921+0000: libxl: libxl_dm.c:2857:stubdom_xswait_cb: Domain 9:Stubdom 10 for 9 startup: startup timed out
2024-09-13 19:15:20.921+0000: libxl: libxl_create.c:1975:domcreate_devmodel_started: Domain 9:device model did not start: -9
2024-09-13 19:15:31.071+0000: libxl: libxl_device.c:1489:libxl__wait_for_backend: Backend /local/domain/0/backend/pci/10/0 not ready
2024-09-13 19:15:31.073+0000: libxl: libxl_pci.c:2098:do_pci_remove: Domain 10:xc_physdev_unmap_pirq irq=17: Invalid argument
2024-09-13 19:15:41.245+0000: libxl: libxl_device.c:1489:libxl__wait_for_backend: Backend /local/domain/0/backend/pci/10/0 not ready
2024-09-13 19:18:26.836+0000: libxl: libxl_dm.c:2857:stubdom_xswait_cb: Domain 11:Stubdom 12 for 11 startup: startup timed out
2024-09-13 19:18:26.836+0000: libxl: libxl_create.c:1975:domcreate_devmodel_started: Domain 11:device model did not start: -9
2024-09-13 19:18:36.988+0000: libxl: libxl_device.c:1489:libxl__wait_for_backend: Backend /local/domain/0/backend/pci/12/0 not ready
2024-09-13 19:18:36.990+0000: libxl: libxl_pci.c:2098:do_pci_remove: Domain 12:xc_physdev_unmap_pirq irq=17: Invalid argument
2024-09-13 19:18:47.160+0000: libxl: libxl_device.c:1489:libxl__wait_for_backend: Backend /local/domain/0/backend/pci/12/0 not ready

The VMs/qubes without PCI devices passed through to it (and not depending on any such VM/qube) start correctly.

Qubes OS R4.2.2 on NovaCustom NV4XPZ with coreboot firmware.

@miczyg1
Copy link
Author

miczyg1 commented Sep 22, 2024

Couple facts/conclusions from Qubes OS summit hackathon:

  1. It is not related to PCI passthrough.
  2. The faulting VMs are the HVM ones. Tested by switching vault VM to HVM without any PCI devices passed through to it. PV/PVH works well.
  3. Debugging the crashes didn't lead to anything meaningful.
  4. Booting with maxcpus=1 in Xen cmdline causes all the VMs to work properly. That said, the AP initialization with slaunch enabled is probably broken and should be investigated.

@andyhhp
Copy link
Contributor

andyhhp commented Sep 22, 2024

To be more specific, it is the PV stub(ISH) qemu for the associated HVM domain which is crashing

@krystian-hebel
Copy link
Member

  1. Booting with maxcpus=1 in Xen cmdline causes all the VMs to work properly. That said, the AP initialization with slaunch enabled is probably broken and should be investigated.

What about enabling SMP, does it change anything?

@krystian-hebel
Copy link
Member

a19bd1b - I'd have to check, but I think Xen makes a copy of IA32_MISC_ENABLES later, and based on it assumes which features are enabled on all cores, not just BSP, and it is modified by TXT.

@andyhhp
Copy link
Contributor

andyhhp commented Sep 22, 2024

We should fix assumptions about MISC_ENABLES. After all, we know that firmware has a perfect track record of setting everything up coherently on APs...

That said, I'm still struggling to see how there's any connection between asymmetric misc enables, and entirely deterministic kernel stack overflow in the DM domain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants