Hailo8 driver hard resets server when run in a VM with passthrough.

Hi, I have proxmox as my host hypervisor that uses QEMU. i have IOMMU working perfctly. i have passed trhough the PCIE device to the guest OS.

When the driver loads in a VM (or at least as the VM boots) the VM causes the physcial server to instantly reset.

In the BMC error log on the server i see a PCIE SERR event. There are no events in the guest VM logs or the host logs as the reset is so instant.

I am using an EPYC 9115 CPU on a asrock rack motherboard.

The hailo is plugged into a PCIE bifurcation card that has 3 other NVMEs on it that work perfectly passed through.

Any ideas?

Hey @Alex_Balcanquall ,

Welcome to the Hailo Community!

Can you try and Test Just the Hailo Card plugged in PCIE bifurcation card and run the following : lspci -nn and dmesg | grep -i iommu to double-check IOMMU group assignments.

If this works and you see the hailo and If your PCIe device shares an IOMMU group, try adding this to your GRUB config:

GRUB_CMDLINE_LINUX="intel_iommu=on pcie_acs_override=downstream"

You can also try :
Force Page Size for Hailo Driver

options hailo_pci force_desc_page_size=4096

It doesnt share an IOMMU group, that was one of first things i checked, also this is an EPYC platform so i don’t need to sepciai intel-iommu=on, is the options hailo something i should do on the kernel command line or in a modprobe.d/file ?

interesting cant edit my other reply - is it because i am new?

thanks for welcoming me, it was apprecciated, just to validate for you

01:00.0 Co-processor [0b40]: Hailo Technologies Ltd. Hailo-8 AI Processor [1e60:2864] (rev 01)
and you can see its the only device in 66

[    0.562515] pci 0000:01:00.0: Adding to iommu group 66
[    0.562544] pci 0000:02:00.0: Adding to iommu group 67```

sigh, full paste

[    0.562485] pci 0000:00:18.7: Adding to iommu group 65
[    0.562515] pci 0000:01:00.0: Adding to iommu group 66
[    0.562544] pci 0000:02:00.0: Adding to iommu group 67

Hi I am pleased to report i have stopped the crashing, that was occuring when the Hailo was plugged onto bifurcation board with other nvme drives.

I moved it on to a different PCIE board that doesn’t require birurcation (though its sill set at 8x4x4x)

the device doesn’t reset, but it does disappear in the VM as a removed card.

this is on the host

root@pve-nas1:/var/lib/vz/snippets# dmesg -T | grep c0:01.1
[Wed May 21 12:05:44 2025] pci 0000:c0:01.1: [1022:153e] type 01 class 0x060400 PCIe Root Port
[Wed May 21 12:05:44 2025] pci 0000:c0:01.1: PCI bridge to [bus c1-c2]
[Wed May 21 12:05:44 2025] pci 0000:c0:01.1:   bridge window [mem 0xd8000000-0xd80fffff]
[Wed May 21 12:05:44 2025] pci 0000:c0:01.1:   bridge window [mem 0x10befff00000-0x10beffffffff 64bit pref]
[Wed May 21 12:05:44 2025] pci 0000:c0:01.1: PME# supported from D0 D3hot D3cold
[Wed May 21 12:05:44 2025] pci 0000:c0:01.1: PCI bridge to [bus c1-c2]
[Wed May 21 12:05:44 2025] pci 0000:c0:01.1: bridge window [io  0x1000-0x0fff] to [bus c1-c2] add_size 1000
[Wed May 21 12:05:44 2025] pci 0000:c0:01.1: bridge window [io  0xd000-0xdfff]: assigned
[Wed May 21 12:05:44 2025] pci 0000:c0:01.1: PCI bridge to [bus c1-c2]
[Wed May 21 12:05:44 2025] pci 0000:c0:01.1:   bridge window [io  0xd000-0xdfff]
[Wed May 21 12:05:44 2025] pci 0000:c0:01.1:   bridge window [mem 0xd8000000-0xd80fffff]
[Wed May 21 12:05:44 2025] pci 0000:c0:01.1:   bridge window [mem 0x10befff00000-0x10beffffffff 64bit pref]
[Wed May 21 12:05:44 2025] pci 0000:c0:01.1: Adding to iommu group 48
[Wed May 21 12:05:44 2025] pcieport 0000:c0:01.1: PME: Signaling with IRQ 68
[Wed May 21 12:05:44 2025] pcieport 0000:c0:01.1: pciehp: Slot #21 AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise- Interlock- NoCompl+ IbPresDis- LLActRep+
[Wed May 21 12:15:13 2025] pcieport 0000:c0:01.1: pciehp: Slot(21): Link Down
[Wed May 21 12:15:13 2025] pcieport 0000:c0:01.1: pciehp: Slot(21): Card not present
[Wed May 21 12:15:14 2025] pcieport 0000:c0:01.1: pciehp: Slot(21): Card present
[Wed May 21 12:15:14 2025] pcieport 0000:c0:01.1: pciehp: Slot(21): Link Up
[Wed May 21 12:15:14 2025] pcieport 0000:c0:01.1: ASPM: current common clock configuration is inconsistent, reconfiguring
[Wed May 21 12:15:14 2025] pcieport 0000:c0:01.1: PCI bridge to [bus c1-c2]
[Wed May 21 12:15:14 2025] pcieport 0000:c0:01.1:   bridge window [io  0xd000-0xdfff]
[Wed May 21 12:15:14 2025] pcieport 0000:c0:01.1:   bridge window [mem 0xd8000000-0xd80fffff]
[Wed May 21 12:15:14 2025] pcieport 0000:c0:01.1:   bridge window [mem 0x10befff00000-0x10beffffffff 64bit pref]
[Wed May 21 12:19:16 2025] pcieport 0000:c0:01.1: pciehp: Slot(21): Link Down
[Wed May 21 12:19:16 2025] pcieport 0000:c0:01.1: pciehp: Slot(21): Card not present
[Wed May 21 12:19:16 2025] pcieport 0000:c0:01.1: pciehp: Slot(21): Card present
[Wed May 21 12:19:16 2025] pcieport 0000:c0:01.1: pciehp: Slot(21): Link Up
[Wed May 21 12:19:16 2025] pcieport 0000:c0:01.1: ASPM: current common clock configuration is inconsistent, reconfiguring
[Wed May 21 12:19:16 2025] pcieport 0000:c0:01.1: PCI bridge to [bus c1-c2]
[Wed May 21 12:19:16 2025] pcieport 0000:c0:01.1:   bridge window [io  0xd000-0xdfff]
[Wed May 21 12:19:16 2025] pcieport 0000:c0:01.1:   bridge window [mem 0xd8000000-0xd80fffff]
[Wed May 21 12:19:16 2025] pcieport 0000:c0:01.1:   bridge window [mem 0x10befff00000-0x10beffffffff 64bit pref]

this is on the guest

root@debian-dev-628:~# dmesg -T | grep hailo
[Wed May 21 12:19:13 2025] hailo_pci: loading out-of-tree module taints kernel.
[Wed May 21 12:19:13 2025] hailo_pci: module verification failed: signature and/or required key missing - tainting kernel
[Wed May 21 12:19:13 2025] hailo: Init module. driver version 4.19.0
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Probing on: 1e60:2864...
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Probing: Allocate memory for device extension, 11632
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Probing: Device enabled
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Probing: mapped bar 0 - (____ptrval____) 16384
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Probing: mapped bar 2 - (____ptrval____) 4096
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Probing: mapped bar 4 - (____ptrval____) 16384
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Probing: Setting max_desc_page_size to 4096, (page_size=4096)
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Probing: Enabled 64 bit dma
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Probing: Using userspace allocated vdma buffers
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Disabling ASPM L0s 
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Successfully disabled ASPM L0s 
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Writing file hailo/hailo8_fw.bin
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: firmware: direct-loading firmware hailo/hailo8_fw.bin
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: File hailo/hailo8_fw.bin written successfully
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Writing file hailo/hailo8_board_cfg.bin
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: firmware: failed to load hailo/hailo8_board_cfg.bin (-2)
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: firmware: failed to load hailo/hailo8_board_cfg.bin (-2)
[Wed May 21 12:19:13 2025] Failed to write file hailo/hailo8_board_cfg.bin
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: File hailo/hailo8_board_cfg.bin written successfully
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: Writing file hailo/hailo8_fw_cfg.bin
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: firmware: failed to load hailo/hailo8_fw_cfg.bin (-2)
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: firmware: failed to load hailo/hailo8_fw_cfg.bin (-2)
[Wed May 21 12:19:13 2025] Failed to write file hailo/hailo8_fw_cfg.bin
[Wed May 21 12:19:13 2025] hailo 0000:01:00.0: File hailo/hailo8_fw_cfg.bin written successfully
[Wed May 21 12:19:14 2025] hailo 0000:01:00.0: Firmware loaded successfully
[Wed May 21 12:19:14 2025] hailo 0000:01:00.0: Probing: Added board 1e60-2864, /dev/hailo0
[Wed May 21 12:19:16 2025] hailo 0000:01:00.0: Remove: Releasing board
[Wed May 21 12:19:16 2025] hailo 0000:01:00.0: Remove: Freed board, /dev/hailo0

i should note i have a hookscript that runs each time the VM starts to issue, so i know thats not the issue

echo > /sys/bus/pci/devices/0000:c1:00.0/reset_method

it seems the board is showing as link down on the host when it resets due to the frimware being loaded?

i have found the root cause of the resets, the card does a hotplug pcie event when the driver loads

if the card is on a bifurcation card or simple pcie adapater this will cause a removal event

if the BIOS on the motherboard has hotplug disabled this will be treated as a fault and on EPYC server motherboard this will cause a reset due to the NMI that is generated.

Fix = turn on hotplug on PCIE slots.

Final soluion to all my issue.

first was to create a modprobe.d/file that contained
options vfio-pci ids=1e60:2864 disable_vfio_pci_flr=1

the second was to remove the device from being mapped through the gui and use this set of args in the vmid.conf file (the hotplug=off was key)

args: -device pcie-root-port,id=pcie_hailo,slot=10,bus=pcie.0,chassis=10,hotplug=off -device vfio-pci,host=c1:00.0,bus=pcie_hailo,addr=0x0