Hi, I have proxmox as my host hypervisor that uses QEMU. i have IOMMU working perfctly. i have passed trhough the PCIE device to the guest OS.
When the driver loads in a VM (or at least as the VM boots) the VM causes the physcial server to instantly reset.
In the BMC error log on the server i see a PCIE SERR event. There are no events in the guest VM logs or the host logs as the reset is so instant.
I am using an EPYC 9115 CPU on a asrock rack motherboard.
The hailo is plugged into a PCIE bifurcation card that has 3 other NVMEs on it that work perfectly passed through.
Any ideas?
omria
May 20, 2025, 9:17pm
2
Hey @Alex_Balcanquall ,
Welcome to the Hailo Community!
Can you try and Test Just the Hailo Card plugged in PCIE bifurcation card and run the following : lspci -nn
and dmesg | grep -i iommu
to double-check IOMMU group assignments.
If this works and you see the hailo and If your PCIe device shares an IOMMU group, try adding this to your GRUB config:
GRUB_CMDLINE_LINUX="intel_iommu=on pcie_acs_override=downstream"
You can also try :
Force Page Size for Hailo Driver
options hailo_pci force_desc_page_size=4096
It doesnt share an IOMMU group, that was one of first things i checked, also this is an EPYC platform so i don’t need to sepciai intel-iommu=on, is the options hailo something i should do on the kernel command line or in a modprobe.d/file ?
omria:
dmesg | grep -i iommu
interesting cant edit my other reply - is it because i am new?
thanks for welcoming me, it was apprecciated, just to validate for you
01:00.0 Co-processor [0b40]: Hailo Technologies Ltd. Hailo-8 AI Processor [1e60:2864] (rev 01)
and you can see its the only device in 66
[ 0.562515] pci 0000:01:00.0: Adding to iommu group 66
[ 0.562544] pci 0000:02:00.0: Adding to iommu group 67```
sigh, full paste
[ 0.562485] pci 0000:00:18.7: Adding to iommu group 65
[ 0.562515] pci 0000:01:00.0: Adding to iommu group 66
[ 0.562544] pci 0000:02:00.0: Adding to iommu group 67
Hi I am pleased to report i have stopped the crashing, that was occuring when the Hailo was plugged onto bifurcation board with other nvme drives.
I moved it on to a different PCIE board that doesn’t require birurcation (though its sill set at 8x4x4x)
the device doesn’t reset, but it does disappear in the VM as a removed card.
this is on the host
root@pve-nas1:/var/lib/vz/snippets# dmesg -T | grep c0:01.1
[Wed May 21 12:05:44 2025] pci 0000:c0:01.1: [1022:153e] type 01 class 0x060400 PCIe Root Port
[Wed May 21 12:05:44 2025] pci 0000:c0:01.1: PCI bridge to [bus c1-c2]
[Wed May 21 12:05:44 2025] pci 0000:c0:01.1: bridge window [mem 0xd8000000-0xd80fffff]
[Wed May 21 12:05:44 2025] pci 0000:c0:01.1: bridge window [mem 0x10befff00000-0x10beffffffff 64bit pref]
[Wed May 21 12:05:44 2025] pci 0000:c0:01.1: PME# supported from D0 D3hot D3cold
[Wed May 21 12:05:44 2025] pci 0000:c0:01.1: PCI bridge to [bus c1-c2]
[Wed May 21 12:05:44 2025] pci 0000:c0:01.1: bridge window [io 0x1000-0x0fff] to [bus c1-c2] add_size 1000
[Wed May 21 12:05:44 2025] pci 0000:c0:01.1: bridge window [io 0xd000-0xdfff]: assigned
[Wed May 21 12:05:44 2025] pci 0000:c0:01.1: PCI bridge to [bus c1-c2]
[Wed May 21 12:05:44 2025] pci 0000:c0:01.1: bridge window [io 0xd000-0xdfff]
[Wed May 21 12:05:44 2025] pci 0000:c0:01.1: bridge window [mem 0xd8000000-0xd80fffff]
[Wed May 21 12:05:44 2025] pci 0000:c0:01.1: bridge window [mem 0x10befff00000-0x10beffffffff 64bit pref]
[Wed May 21 12:05:44 2025] pci 0000:c0:01.1: Adding to iommu group 48
[Wed May 21 12:05:44 2025] pcieport 0000:c0:01.1: PME: Signaling with IRQ 68
[Wed May 21 12:05:44 2025] pcieport 0000:c0:01.1: pciehp: Slot #21 AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise- Interlock- NoCompl+ IbPresDis- LLActRep+
[Wed May 21 12:15:13 2025] pcieport 0000:c0:01.1: pciehp: Slot(21): Link Down
[Wed May 21 12:15:13 2025] pcieport 0000:c0:01.1: pciehp: Slot(21): Card not present
[Wed May 21 12:15:14 2025] pcieport 0000:c0:01.1: pciehp: Slot(21): Card present
[Wed May 21 12:15:14 2025] pcieport 0000:c0:01.1: pciehp: Slot(21): Link Up
[Wed May 21 12:15:14 2025] pcieport 0000:c0:01.1: ASPM: current common clock configuration is inconsistent, reconfiguring
[Wed May 21 12:15:14 2025] pcieport 0000:c0:01.1: PCI bridge to [bus c1-c2]
[Wed May 21 12:15:14 2025] pcieport 0000:c0:01.1: bridge window [io 0xd000-0xdfff]
[Wed May 21 12:15:14 2025] pcieport 0000:c0:01.1: bridge window [mem 0xd8000000-0xd80fffff]
[Wed May 21 12:15:14 2025] pcieport 0000:c0:01.1: bridge window [mem 0x10befff00000-0x10beffffffff 64bit pref]
[Wed May 21 12:19:16 2025] pcieport 0000:c0:01.1: pciehp: Slot(21): Link Down
[Wed May 21 12:19:16 2025] pcieport 0000:c0:01.1: pciehp: Slot(21): Card not present
[Wed May 21 12:19:16 2025] pcieport 0000:c0:01.1: pciehp: Slot(21): Card present
[Wed May 21 12:19:16 2025] pcieport 0000:c0:01.1: pciehp: Slot(21): Link Up
[Wed May 21 12:19:16 2025] pcieport 0000:c0:01.1: ASPM: current common clock configuration is inconsistent, reconfiguring
[Wed May 21 12:19:16 2025] pcieport 0000:c0:01.1: PCI bridge to [bus c1-c2]
[Wed May 21 12:19:16 2025] pcieport 0000:c0:01.1: bridge window [io 0xd000-0xdfff]
[Wed May 21 12:19:16 2025] pcieport 0000:c0:01.1: bridge window [mem 0xd8000000-0xd80fffff]
[Wed May 21 12:19:16 2025] pcieport 0000:c0:01.1: bridge window [mem 0x10befff00000-0x10beffffffff 64bit pref]
this is on the guest
root@debian-dev-628:~# dmesg -T | grep hailo
[Wed May 21 12:19:13 2025] hailo_pci: loading out-of-tree module taints kernel.
[Wed May 21 12:19:13 2025] hailo_pci: module verification failed: signature and/or required key missing - tainting kernel
[Wed May 21 12:19:13 2025] hailo: Init module. driver version 4.19.0
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Probing on: 1e60:2864...
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Probing: Allocate memory for device extension, 11632
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Probing: Device enabled
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Probing: mapped bar 0 - (____ptrval____) 16384
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Probing: mapped bar 2 - (____ptrval____) 4096
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Probing: mapped bar 4 - (____ptrval____) 16384
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Probing: Setting max_desc_page_size to 4096, (page_size=4096)
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Probing: Enabled 64 bit dma
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Probing: Using userspace allocated vdma buffers
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Disabling ASPM L0s
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Successfully disabled ASPM L0s
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Writing file hailo/hailo8_fw.bin
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: firmware: direct-loading firmware hailo/hailo8_fw.bin
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: File hailo/hailo8_fw.bin written successfully
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Writing file hailo/hailo8_board_cfg.bin
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: firmware: failed to load hailo/hailo8_board_cfg.bin (-2)
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: firmware: failed to load hailo/hailo8_board_cfg.bin (-2)
[Wed May 21 12:19:13 2025] Failed to write file hailo/hailo8_board_cfg.bin
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: File hailo/hailo8_board_cfg.bin written successfully
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Writing file hailo/hailo8_fw_cfg.bin
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: firmware: failed to load hailo/hailo8_fw_cfg.bin (-2)
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: firmware: failed to load hailo/hailo8_fw_cfg.bin (-2)
[Wed May 21 12:19:13 2025] Failed to write file hailo/hailo8_fw_cfg.bin
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: File hailo/hailo8_fw_cfg.bin written successfully
[Wed May 21 12:19:14 2025] hailo 0000:01:00.0: Firmware loaded successfully
[Wed May 21 12:19:14 2025] hailo 0000:01:00.0: Probing: Added board 1e60-2864, /dev/hailo0
[Wed May 21 12:19:16 2025] hailo 0000:01:00.0: Remove: Releasing board
[Wed May 21 12:19:16 2025] hailo 0000:01:00.0: Remove: Freed board, /dev/hailo0
i should note i have a hookscript that runs each time the VM starts to issue, so i know thats not the issue
echo > /sys/bus/pci/devices/0000:c1:00.0/reset_method
it seems the board is showing as link down on the host when it resets due to the frimware being loaded?
i have found the root cause of the resets, the card does a hotplug pcie event when the driver loads
if the card is on a bifurcation card or simple pcie adapater this will cause a removal event
if the BIOS on the motherboard has hotplug disabled this will be treated as a fault and on EPYC server motherboard this will cause a reset due to the NMI that is generated.
Fix = turn on hotplug on PCIE slots.
Final soluion to all my issue.
first was to create a modprobe.d/file that contained
options vfio-pci ids=1e60:2864 disable_vfio_pci_flr=1
the second was to remove the device from being mapped through the gui and use this set of args in the vmid.conf file (the hotplug=off was key)
args: -device pcie-root-port,id=pcie_hailo,slot=10,bus=pcie.0,chassis=10,hotplug=off -device vfio-pci,host=c1:00.0,bus=pcie_hailo,addr=0x0