Hi everyone,
I am experiencing an issue where my new Raspberry Pi AI HAT+ 2 (featuring the Hailo-10H with 8GB RAM) is failing to load multi-context models (such as Qwen2 and Whisper-Base) using the Python/C++ APIs, even though running individual network groups via hailortcli run2 set-net works successfully.
Here are the details of my setup and the diagnostic logs.
Environment & Specs
- Host Platform: Raspberry Pi 5 (8GB)
- OS: Raspberry Pi OS 64-bit (Debian Trixie base)
- Kernel:
6.12.34+rpt-rpi-v8 #1 SMP PREEMPT - NPU Hardware: Raspberry Pi AI HAT+ 2 (Hailo-10H, 8GB dedicated memory)
- Power Supply: Official Raspberry Pi 27W USB-C PSU
- HailoRT Version: 5.3.0
- PCIe Driver Version: 5.3.0 (from
hailort-pcie-driver) - Firmware Version: 5.3.0 (programmed successfully by driver during boot)
What Works:
Running individual groups of a compiled HEF independently with hailortcli run2 set-net works perfectly at Gen 2 speed. For example, benchmarking Whisper-Base or Qwen2 groups:
hailortcli run2 set-net Whisper-Base.hef --name whisper_base_10s_encoder→ Runs at 19.78 FPS (Succeeded)hailortcli run2 set-net qwen2_1.5b.hef --name base_model__prefill→ Runs at 3.40 FPS (Succeeded)
The Problem:
When attempting to initialize the model using high-level APIs that load the full pipeline (e.g. C++ LLM class in hailo-ollama, or Python LLM(vdevice, model_path) and Speech2Text(vdevice, model_path) classes), the initialization fails immediately on the first launch transfer:
Application Log:
[HailoRT] [error] Ioctl HAILO_VDMA_LAUNCH_TRANSFER failed with 5. Read dmesg log for more info
[HailoRT] [error] CHECK_SUCCESS failed with status=HAILO_DRIVER_OPERATION_FAILED(36) - Failed launch transfer
[HailoRT] [error] Ioctl HAILO_SOC_CLOSE failed due to timeout
[HailoRT] [error] CHECK_SUCCESS failed with status=HAILO_DRIVER_TIMEOUT(87) - Failed soc_close
Host dmesg Output during failure:
[55.963169] hailo1x 0001:01:00.0: Timeout waiting for soc control (timeout_ms=1000)
[55.963175] hailo1x 0001:01:00.0: soc_close failed with err=-110
[55.978742] Channel 0 num-available HW mismatch (20d!=65535d)
[55.978752] hailo1x 0001:01:00.0: Failed to launch transfer
[55.978865] Channel 0 num-available HW mismatch (20d!=65535d)
[55.978868] hailo1x 0001:01:00.0: Failed to launch transfer
Once this timeout occurs, the PCIe NPU interface is wedged. If I attempt to programmatically remove and rescan the device to reload firmware, the driver fails to re-activate the board over vDMA:
[417.173000] hailo1x 0001:01:00.0: Timeout waiting for vDMA boot data completion
[417.173050] hailo1x 0001:01:00.0: Failed writing firmware files over vDMA. err -110
[417.173100] hailo1x 0001:01:00.0: Failed writing SOC firmware on stage 3
[421.199000] hailo1x 0001:01:00.0: SCU log could not be read from device
[421.199100] hailo1x 0001:01:00.0: Firmware load failed
[422.020000] hailo1x 0001:01:00.0: Failed activating board -110
[422.020100] hailo1x 0001:01:00.0: probe with driver hailo1x failed with error -110
Only a full hardware power cycle recovers the board back to detection.
Troubleshooting Executed So Far:
-
Physical Connections: Re-seated the FFC (ribbon) cable completely on both the Pi 5 and the AI HAT+ 2 connector slots. The cable is completely flat and locked down.
-
PCIe Gen Settings: Tested at both
dtparam=pciex1_gen=2anddtparam=pciex1_gen=3. The behavior is identical. -
PCIe Bus Integrity Audit: Polled the PCIe Advanced Error Reporting (AER) registers immediately after running the successful
hailortcli run2benchmarks. The error status registers show exactly zero receiver errors (RxErr-,BadTLP-,BadDLLP-), proving the physical cable and PCIe link layer are not experiencing transmission drops.
Is this a known issue/bug with the 5.3.0 firmware and PCIe driver when allocating memory channels for multi-context pipelines, or is it likely a defect with the LPDDR4X memory/silicon on the AI HAT+ 2 board itself?
Thanks in advance for the help!