Hailo-10H COMMUNICATION_CLOSED after ~5000 frames of continuous inference (AI HAT+ 2, Pi 5)

SamSkjord · March 26, 2026, 3:28pm

Hi all,

I’ve been hitting a consistent crash running continuous YOLOv8m inference on a Pi 5 + AI HAT+ 2 (Hailo-10H). After some debugging I’ve narrowed it down to what I believe is a
firmware/driver issue rather than anything application-side.

Setup:

Pi 5 2GB, Raspbian Lite (Trixie), kernel 6.12.47
AI HAT+ 2 with active cooling
hailo-h10-all 5.1.1, EEPROM up to date
blacklist hailo_pci in modprobe.d
dkms installed before hailo packages

The problem:

Inference works fine for about 2-3 minutes, then I get HAILO_COMMUNICATION_CLOSED(62) and the device becomes unresponsive. Only a full power cycle recovers it.

To rule out my application code, I stripped everything down to the absolute minimum - no camera, no display, just VDevice + InferModel + dummy numpy array in a loop:

vd = VDevice()
im = vd.create_infer_model(“yolov8m_h10.hef”)
im.input().set_format_type(FormatType.UINT8)
ctx = im.configure()
cm = ctx.enter()
bufs = {o.name: np.empty(o.shape, dtype=np.float32) for o in im.outputs}
bindings = cm.create_bindings(output_buffers=bufs)
dummy = np.random.randint(0, 255, (640, 640, 3), dtype=np.uint8)

while True:
    bindings.input().set_buffer(np.array(dummy))
    cm.run([bindings], 30000)

This crashes at frame ~5334 every time (~2 min 16s at 39fps). Totally reproducible.

What I’ve tried:

Fresh Raspbian image (twice)
pcie_aspm=off on the kernel cmdline - this extended it to ~6246 frames (~3 min 9s), so definitely PCIe power management related but not the whole story
Reseating the FPC ribbon cable multiple times
With and without dtparam=pciex1_gen=3 (no difference)
Attempted 5.2.0 driver but it ships with an empty dkms.conf and the compiled module fails to load with “Invalid argument” on kernel 6.12.47

Pi stays cool throughout (48C), Hailo has active cooling, 1.6GB free memory. lspci shows the link at 8GT/s x1 which I understand is expected for the Pi 5’s single PCIe lane.

dmesg after crash:
Nothing useful - no PCIe errors, no kernel warnings. The device just silently drops the connection from the Hailo side.

Has anyone else seen this? Is there a known issue with sustained inference on the 10H, or should I be looking at an RMA?

Cheers

Michael · March 26, 2026, 4:56pm

Hi @SamSkjord Thanks for letting us know, we are checking internally and will come back to you here ASAP.

Michael · March 26, 2026, 8:07pm

Hi @SamSkjord ,

Can you please try following the complete uninstall: Raspberry Trixie Error with Guide (Pi 5, AI Hat 2) - #18 by Michael

Then installing manually 5.2.0 from our developer zone?
You will need for arm64:

sudo apt install PCIe driver deb
sudo apt install HailoRT deb
sudo reboot now
sudo apt install Tappas deb
pip install HailoRT Py wheel --break-system-packages
pip install Tappas Py wheel --break-system-packages

There are also details here: Upgrading to HailoRT 5.2.0 - step by step (Raspberry PI & Hailo Apps)

Thanks,

SamSkjord · March 27, 2026, 8:51am

Thanks Michael. I followed the uninstall and reinstall process but ran into compatibility issues with the AI HAT+ 2 (Hailo-10H):

The hailort and hailort-pcie-driver packages available via apt are 4.23.0, which only include Hailo-8 firmware and the hailo_pci driver module. They don’t support the Hailo-10H.
The 5.2.0 .deb files from the developer zone (hailort_5.2.0_arm64.deb, hailort-pcie-driver_5.2.0_all.deb) install the driver and runtime, but:
- No Hailo-10H firmware is included (the /lib/firmware/hailo/hailo10h/ directory is empty)
- No python3-hailort package at 5.2.0 is available - the apt version is 4.23.0 which needs libhailort.so.4.23.0 and won’t work with the 5.2.0 runtime
The only packages that support the Hailo-10H are the h10-hailort / h10-hailort-pcie-driver / hailo-h10-all set at version 5.1.1, which is where the crash occurs.

I should also mention that the Hailo firmware loading itself is unreliable - only about 1 in 3 cold boots successfully loads the firmware. Most fail with “Timeout waiting for firmware file on stage 2”. This is on a fresh Raspbian Lite Trixie image with updated EEPROM, dkms installed before hailo packages, and pcie_aspm=off on the kernel cmdline. We also tested with and without dtparam=pciex1_gen=3 in config.txt - no
consistent difference. lspci confirms the link runs at 8GT/s Gen3 x1 either way, so the HAT appears to auto-negotiate correctly. The boot failure happens regardless of which driver version is installed (5.1.1 or 5.2.0).

Could you clarify which specific packages and versions support the AI HAT+ 2 at 5.2.0? Or is there a separate Hailo-10H build that I’m missing?

For reference, here’s the full diagnostic info:

Kernel: 6.12.47+rpt-rpi-2712 (aarch64)
EEPROM: Wed 5 Nov 17:37:18 UTC 2025 (up to date)
Pi model: Raspberry Pi 5 2GB

Packages:
h10-hailort              5.1.1
h10-hailort-pcie-driver  5.1.1
hailo-h10-all            5.1.1
python3-h10-hailort      5.1.1-1
hailo-tappas-core        5.1.0

PCIe link:
LnkCap: Speed 8GT/s, Width x4, ASPM L0s L1
LnkSta: Speed 8GT/s, Width x1 (downgraded)

cmdline includes: pcie_aspm=off
config.txt: dtoverlay=vc4-kms-dsi-waveshare-panel,4_0_inchC (no pciex1_gen override)

dkms: hailo1x_pci/5.1.1 installed for 6.12.47+rpt-rpi-2712
Blacklist: hailo_pci in /etc/modprobe.d/hailo-blacklist.conf

The bare stress test that reproduces the crash is in my original post - just VDevice + InferModel + dummy data, no other hardware involved.

Michael · March 27, 2026, 3:15pm

HI @SamSkjord,

Just to make sure about the steps you took:

Complete uninstall - starting from clean fresh env.
Only manual installation of all the 5 files with version 5.2
Specifically - skipping the dedicated sudo apt install hailo-h10-all

Can you please confirm?

Thanks,

SamSkjord · March 29, 2026, 4:18pm

Following the upgrade guide, I installed the 5 files from the developer zone (hailort-pcie-driver, hailort, hailo-tappas-core, hailo-gen-ai-model-zoo debs,
plus the Python wheel). A few issues along the way:

The hailort-pcie-driver_5.2.0_all.deb conflicted with h10-hailort-pcie-driver (5.1.1) - had to purge the h10 packages first
The Python wheel is cp312 only, but Trixie ships Python 3.13. Had to build Python 3.12.9 from source and create a separate venv to get the wheel to
install

Final working package state:
hailort 5.2.0
hailort-pcie-driver 5.2.0
hailo-tappas-core 5.2.0
hailo-gen-ai-model-zoo 5.2.0
hailo-models 1.0.0-2 (from apt, for the .hef files)
Python wheel: hailort-5.2.0-cp312 (in Python 3.12 venv)
Firmware: /lib/firmware/hailo/hailo10h/ (from old h10-hailort-pcie-driver 5.1.1)
Kernel: 6.12.75+rpt-rpi-2712

5.2.0 fixes the standalone inference crash!

The bare stress test (VDevice + InferModel + dummy data, no camera, no display) now runs the full 5 minutes with zero errors:

HailoRT 5.2.0: 11773 frames, 300s, 0 errors (39fps)
HailoRT 5.1.1: crashed at 5334 frames (136s)

Camera + Hailo without display also runs clean:
Camera + Hailo (no pygame): 7754 frames, 309s, 0 errors (~27fps with real USB camera frames)

Display + Hailo still crashes

When pygame renders to a Waveshare DSI display via SDL2 KMSDRM, the Hailo crashes on the very first inference with HAILO_COMMUNICATION_CLOSED even when
running in a completely separate process via multiprocessing (spawn):

Camera + Hailo + pygame KMSDRM: crashes on frame 1

I’ve confirmed the conflict is specifically the V3D GPU. On the Pi 5 there are three DRM cards:

card0: rp1dsi (DSI display controller, no GPU)
card1: V3D (GPU)
card2: vc6 (video core, HDMI)

SDL2’s KMSDRM backend requires a render node, so it uses card1 (V3D) for compositing even though the DSI display is on card0. The V3D GPU’s DMA operations
appear to conflict with the Hailo’s PCIe VDMA at the kernel level, process isolation doesn’t help since they share kernel DMA resources.

Rendering without display (dummy driver), without GPU (software render), and without pygame at all have all been tested, the Hailo only crashes when V3D
GPU page flips are active.

Is there a known incompatibility between V3D GPU DMA and Hailo-10H PCIe VDMA on Pi 5? Or is there a way to render to the DSI display through card0 without
involving V3D?

Michael · March 29, 2026, 6:21pm

Hi @SamSkjord,

I suspect the first step of complete uninstall was not successful.
The target is to have the RPI clean completely from all previous Hailo leftovers.
In such case, the manual installation should proceed without any conflicts.
In case you wish, please reach me via a private message to set up RPI connect and I’ll assist remotely.

Thanks,
Michael.

SamSkjord · March 29, 2026, 9:12pm

Thanks Michael. The install was done on a completely fresh Raspbian Lite Trixie image with no previous Hailo packages installed. The steps were:

Fresh flash of Raspbian Lite (Trixie) to SD card
sudo apt update && sudo apt full-upgrade 3. sudo apt install dkms
Installed the 5 files from the developer zone (hailort-pcie-driver, hailort, hailo-tappas-core, hailo-gen-ai-model-zoo debs, plus the cp312 Python wheel)
Built Python 3.12.9 from source since Trixie ships 3.13 and the wheel is cp312 only
Created a Python 3.12 venv and installed the wheel there

No hailo-h10-all, no h10-hailort, no previous versions. Completely from scratch. The only package from apt was hailo-models for the .hef files, and the
Hailo-10H firmware files at /lib/firmware/hailo/hailo10h/ which I copied from the h10-hailort-pcie-driver 5.1.1 package since the 5.2.0 driver doesn’t
include them.

To be clear, 5.2.0 standalone inference works perfectly at 11,773 frames over 5 minutes with zero errors. The issue is specifically when SDL2 KMSDRM is
active in the same kernel, even in a different process. I’ve done a fair amount of debugging to narrow this down:

What I’ve tested on 5.2.0 (all from fresh boots, no hailortcli beforehand):

Test	Result
Bare Hailo, dummy data, no display	11,773 frames, 5 min, 0 errors
Camera + Hailo + pygame KMSDRM (same process)	Crashes on frame 1
Camera + Hailo + pygame KMSDRM (separate process via multiprocessing spawn)	Crashes on frame 1
SDL dummy driver + framebuffer mmap (/dev/fb0)	5 min, 0 errors
USB camera + Hailo, no display	7,754 frames, 5 min, 0 errors
V3D GPU blacklisted + KMSDRM	Ran briefly then crashed

The DRM cards on my Pi 5 are:

card0: raspberrypi,rp1dsi (DSI panel, no GPU)
card1: brcm,2712-v3d (GPU)
card2: brcm,bcm2712-vc6 (video core, HDMI)

SDL2 KMSDRM uses card1/card2 for compositing even though the display is on card0. Blacklisting V3D removes card1 but vc4 still does DRM page flips through
card2 and the crash persists.

The only configuration that keeps Hailo stable is bypassing DRM entirely by rendering to pygame surfaces with the SDL dummy driver and writing raw pixels to
/dev/fb0 via mmap. This works but loses GPU acceleration and SDL input handling.

Michael · March 30, 2026, 8:01am

Thanks @SamSkjord , we are checking it.

SamSkjord · March 30, 2026, 12:54pm

Thanks Michael

Further findings with HailoRT 5.2.0:
The standalone stress test (dummy numpy data, no camera, no display) remains stable at 11,773 frames / 5 minutes with zero errors, confirming 5.2.0 fixed the core runtime issue.

However, when using real USB camera frames (OpenCV V4L2, 1280x720 MJPG at 30fps) alongside inference, the Hailo crashes after 30 seconds to 4 minutes with “Failed to launch transfer” in dmesg, regardless of display method:

Testing matrix (all on HailoRT 5.2.0, kernel 6.12.75):

Bare Hailo, dummy data, no camera, no display: 11,773 frames, 5 min, 0 errors
Camera + Hailo, no display: 7,754 frames, 5 min, 0 errors (1 timeout at 309s)
Camera + Hailo + display (any method): crashes at 30s to 4min

The display method doesn’t matter. Tested KMSDRM, Wayland (labwc), and direct framebuffer (/dev/fb0 mmap with SDL dummy driver). All crash at roughly the same point. The common factor is USB camera V4L2 DMA + Hailo PCIe VDMA running simultaneously with any rendering.

The camera-only test (no display) ran nearly clean for 5 minutes, so the camera DMA alone doesn’t kill Hailo. It’s the three-way combination of camera DMA + inference DMA + display writes that causes the crash.

dmesg shows “Failed to launch transfer” errors starting at ~238s after boot, with no prior warnings. No PCIe errors, no kernel warnings. The device becomes completely unresponsive after the first transfer failure.

Pi 5 2GB, USB camera on xhci-hcd.0-1, Hailo on PCIe 0001:01:00.0, DSI display on rp1-dsi.

Happy to provide remote access if that would help investigate.

Michael · April 15, 2026, 12:34pm

Hi @SamSkjord,
Can you please pull the latest hailo-apps and test again?
Thanks,

SamSkjord · April 16, 2026, 10:40am

Hi Michael,
Thanks for the followup, pulled and did some testing:

Setup

Raspberry Pi 5 (2GB), Raspbian Lite Trixie, kernel( 6.12.75, Hailo-10H AI HAT+ 2 via PCIe. HailoRT 5.2.0, hailort-pcie-driver 5.2.0, hailo-apps-infra 26.3.0, Python 3.12.
YOLOv8m HEF from v5.2.0/hailo10h/yolov8m.hef.
pcie_aspm=off in cmdline.txt.
PCIe link up at 8.0 GT/s x1. No dtparam=pciex1_gen=3 forcing; it auto-negotiates Gen 3.
USB camera on /dev/video0 at 1280x720 MJPG 30fps.
No display, no pygame,

Prior state on HailoRT 5.1.1

Running the bare inference stress with dummy data (no camera, no display) held up for 5 minutes at thousands of frames with zero errors. Adding the USB camera, also without a display, held up for 5 minutes with zero errors. The crash only appeared when camera + inference + display ran together, between 30 seconds and 4 minutes. Upgrading to HailoRT 5.2.0 fixed the standalone infinite-loop crash but did not fix the combined workload.

Current state on HailoRT 5.2.0

Moving to 5.2.0 has changed the picture. Camera + inference without any display now also crashes, which was not the case on 5.1.1 in the same configuration.

To eliminate our code as a factor, I installed hailo-apps-infra 26.3.0 and ran the reference app hailo-detect-simple with a USB camera:

~/hailo-apps-infra/venv_hailo_apps/bin/hailo-detect-simple \
  --input usb --arch hailo10h --show-fps --disable-sync

Crash at frame 567, about 30 seconds in, at 19 FPS:

Frame count: 566
INFO | gstreamer.gstreamer_app | FPS measurement: 19.71, drop=0.00, avg=19.32
[HailoRT] [error] Failed to send request, status = HAILO_COMMUNICATION_CLOSED(62)
(Hailo Simple Detection App:14007): ERROR: Failed to run async inference, status = 62

This was pure GStreamer + hailo_apps_infra on HailoRT 5.2.0. No display, no pygame, no custom PROX code. It should be straightforward to reproduce on your side using the same HEF and a USB camera.

Minimal Python async reproducer

I wrote a short standalone script to get precise timing and to vary parameters. It does V4L2 grab, letterbox preprocess, wait_for_async_ready + run_async + job.wait(5000) in a tight loop, and prints the frame count and elapsed time at every crash.

import time, sys
import numpy as np
import cv2
from hailo_platform import VDevice, FormatType

cap = cv2.VideoCapture(0, cv2.CAP_V4L2)
cap.set(cv2.CAP_PROP_FOURCC, cv2.VideoWriter_fourcc(*'MJPG'))
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 720)
cap.set(cv2.CAP_PROP_FPS, 30)

v = VDevice()
m = v.create_infer_model('/usr/share/hailo-models/yolov8m_h10.hef')
m.input().set_format_type(FormatType.UINT8)
ctx = m.configure()
cfg = ctx.__enter__()
ob = {o.name: np.empty(o.shape, dtype=np.float32) for o in m.outputs}
b = cfg.create_bindings(output_buffers=ob)
mh, mw = m.input().shape[:2]

start = time.monotonic(); n = 0
while True:
    ok, frame = cap.read()
    if not ok: continue
    rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    sc = min(mw/rgb.shape[1], mh/rgb.shape[0])
    nh, nw = int(rgb.shape[0]*sc), int(rgb.shape[1]*sc)
    rs = cv2.resize(rgb, (nw, nh))
    pad = np.full((mh, mw, 3), 114, dtype=np.uint8)
    yo, xo = (mh-nh)//2, (mw-nw)//2
    pad[yo:yo+nh, xo:xo+nw] = rs
    b.input().set_buffer(pad)
    try:
        cfg.wait_for_async_ready(1000)
        job = cfg.run_async([b])
        job.wait(5000)
    except Exception as e:
        print(f'CRASH frame={n} elapsed={time.monotonic()-start:.1f}s: {e}', flush=True)
        sys.exit(1)
    n += 1
    if n % 30 == 0:
        el = time.monotonic()-start
        print(f'frame={n} elapsed={el:.1f}s fps={n/el:.1f}', flush=True)

Variance study

I wrapped the reproducer in a harness that power cycles the Pi between trials and records the crash time. Five trials at each of two CMA sizes:

cma=256M: crash at 57.2s, 37.8s, 280.3s, 23.1s, 13.5s. Median 37.8s.
cma=320M: crash at 9.2s, 43.6s, 18.5s, no-crash-in-300s, 23.5s. Median 23.5s.

Within each config the spread is 20x or more. Each batch had one trial that ran for several minutes cleanly, while others crashed in under 20 seconds. Differences between the two CMA sizes are smaller than the within-config variance. The distribution looks bimodal: about 80 percent of runs crash under 60 seconds, the rest run for multiple minutes.

This pattern is consistent with a timing-sensitive race rather than steady resource exhaustion. Raising CMA from 64M (the default, where CmaFree drops to double-digit kB under load) to 320M did not eliminate the crash. gpu_mem=128 did not change the CMA ceiling, which is about 320M on this 2GB board because of how firmware lays out reserved regions. CMA pressure does not appear to be the root cause.

Two failure signatures

Nine out of ten async trials failed with this signature:

[HailoRT] [error] CHECK failed - Waiting for async job to finish has failed with timeout (5000ms)
[HailoRT] [error] Ioctl HAILO_VDMA_LAUNCH_TRANSFER failed with 5
[HailoRT] [error] CHECK_SUCCESS failed with status=HAILO_DRIVER_OPERATION_FAILED(36)

kernel log at crash:

hailo1x 0001:01:00.0: Timeout waiting for soc control (timeout_ms=1000)
hailo1x 0001:01:00.0: soc_close failed with err=-110

One out of ten trials (and the hailo-detect-simple run) failed with HAILO_COMMUNICATION_CLOSED(62) instead. These appear to be two surface symptoms of the same underlying event, depending on which codepath hits first.

Driver source analysis

I read through the hailort-pcie-driver 5.2.0 DKMS source to localise the error paths.

HAILO_VDMA_LAUNCH_TRANSFER errno=5 (EIO) originates in validate_channel_state in common/vdma_common.c:

if (channel->state.num_avail != hw_num_avail) {
    pr_err("Channel %d num-available HW mismatch (%ud!=%ud)\n", ...);
    return -EIO;
}
if (channel->state.num_avail != desc_list->num_launched) {
    pr_err("Channel %d num-available desc-list mismatch (%ud!=%ud)\n", ...);
    return -EIO;
}

Either the driver’s software num_avail counter diverges from the hardware register, or two internal software counters diverge from each other. I have not yet captured these pr_err lines in dmesg during a crash.

Timeout waiting for soc control comes from linux/pcie/src/soc.c:66 where wait_for_completion_timeout expires after 1 second. At that point the SOC firmware has stopped responding on the PCIe control channel. soc_close then also fails, so the firmware is not processing control messages at teardown either.

The sequence that fits all observations:

SOC firmware stops servicing the inference channel (no completion IRQ fires for the current transfer).
job.wait(5000) in hailort times out.
Subsequent VDMA ioctls fail with EIO because descriptor state has drifted from the hardware.
The Python VDevice destructor calls soc_close, which also times out since the firmware is unresponsive.
pyhailort surfaces either HAILO_DRIVER_OPERATION_FAILED or HAILO_COMMUNICATION_CLOSED depending on which ioctl caught the firmware stall first.

This points the root cause to the firmware side. The host driver and hailort are correctly detecting that the firmware went unresponsive. The high variance in crash time is consistent with a race or external-event-triggered firmware fault that only fires under specific timing conditions.

Other observations

Firmware boot lottery: even on a clean power cycle, the SOC firmware load fails roughly 40 to 50 percent of the time with Failed writing SOC firmware on stage 2 or stage 3 followed by Firmware load failed and Failed activating board -110. This is independent of the runtime crash but may share a common cause since both involve the SOC side going unresponsive. EEPROM is up to date, cold-power delays of up to 45 seconds help but do not eliminate the lottery.

Sometimes after a crash, rmmod hailo1x_pci && modprobe hailo1x_pci fails with Probing: Failed reading device BARs, device may be disconnected, requiring a full mains power cycle of the Pi.

Questions

Is there a verbose driver or HailoRT logging mode that would surface the descriptor state and any prior warnings before the wedge? Capturing the pr_err from validate_channel_state and the inference-side trace would help localise further.
Does the SOC firmware emit any log that can still be read over PCIe after the control channel stops responding? An HRT or SCU log dump path that works post-wedge would be valuable.
Is there a firmware watchdog that could auto-reset the SoC without a full power cycle? Currently my only recovery path after a wedge is mains power off and on, which also trips the firmware-load boot lottery roughly half the time.
Any targeted tests you want me to run? The reproducer above is short, and I can add instrumentation (capture dmesg at crash, try CSI camera to change the DMA path, run with a smaller model like yolov8n, add kernel-level tracing for VDMA ioctls), or vary API paths as you suggest.

Thanks,
Sam

Michael · April 16, 2026, 12:10pm

Hi @SamSkjord ,

Thanks for the detailed response - this is helpful.
Just to make sure I get it right - " Minimal Python async reproducer" stand alone code - causing the errors? e.g., it’s not related to any specific app from hailo-apps: GitHub - hailo-ai/hailo-apps · GitHub? (please note this is the new official apps repo - replacing the deprecated hailo_apps_infra).

Thanks,

SamSkjord · April 16, 2026, 2:54pm

Hi Michael,

Pulled hailo-apps (the new official repo) and re-ran hailo-detect-simple against it. Two trials with USB camera + HailoRT 5.2.0:

Trial 1: crash at frame 7867, about 262 seconds in, sustained 30 FPS.
Trial 2: crash at frame 2470, about 82 seconds in, sustained 30 FPS.

Both crashed with the same HAILO_COMMUNICATION_CLOSED(62):

[HailoRT] [error] Failed to send request, status = HAILO_COMMUNICATION_CLOSED(62)
ERROR: Failed to run async inference, status = 62

So the new hailo-apps reproduces the same crash. Compared to the deprecated hailo_apps_infra test (crashed at frame 567, around 30 seconds, at 19 FPS), the new app runs longer and at higher throughput, but the underlying firmware wedge still hits within minutes. The variance between trial 1 and trial 2 is the same bimodal pattern (some runs short, some runs long) we saw on the standalone Python reproducer.

Trial 3 attempt failed differently: the SOC firmware came up enough for /dev/hailo0 to appear, but hailortcli fw-control identify was not yet responsive, and hailo-detect-simple failed at startup with Ioctl HAILO_SOC_CONNECT failed due to connection refused / HAILO_CONNECTION_REFUSED(89). Subsequent power cycles hit the firmware-load boot lottery (Failed writing SOC firmware on stage 2). One of those firmware-load failures triggered a kernel oops in print_scu_log inside hailo1x_pci:

[   11.683329] hailo1x 0001:01:00.0: Timeout waiting for firmware file
[   11.683334] hailo1x 0001:01:00.0: Failed writing SOC firmware on stage 2
[   11.858786] pc : print_scu_log+0xa8/0x190 [hailo1x_pci]
[   11.864388] lr : print_scu_log+0x98/0x190 [hailo1x_pci]
[   11.951234]  print_scu_log+0xa8/0x190 [hailo1x_pci]
[   11.956434]  hailo_activate_board+0x464/0x818 [hailo1x_pci]
[   11.962327]  hailo_pcie_probe+0x464/0x6c8 [hailo1x_pci]

So even when firmware load fails, the diagnostic log path itself is faulting. Probably worth a separate look on your side, because it likely loses the SCU log content that would explain why stage 2 timed out.

To summarise:

New hailo-apps reproduces the crash with the same signature as the deprecated hailo_apps_infra and as the standalone Python reproducer.
Standalone Python reproducer (no GStreamer, no hailo-apps in either form) reproduces it independently.
The print_scu_log oops is a separate issue but blocks debugging when firmware load fails.

Happy to run anything else you suggest, I’m away fri-mon but can set up remote access to the pi after that if you’d like?

Thanks,
Sam