Hailo-10H throughput degrades irreversibly within minutes of continuous use (125 → 41 fps), only host reboot recovers

RGaufman · April 30, 2026, 6:51am

A Hailo-10H module on a Raspberry Pi 5 with the official AI Hat 2 starts at the
published 125 fps (yolo26s_hailo10h.hef, hailortcli benchmark) immediately
after host reboot, but throughput degrades monotonically and irreversibly
during continuous use. After ~3.4 minutes of back-to-back benchmarks (12 runs
× 15 s), fps starts dropping at ~3 fps/run, reaching ~82 fps at run 25. With
extended use we have observed it drop as low as 41 fps. The chip never
recovers in the same boot, not after cooldown, not after driver reload, not
after PCIe rescan, not after hailortcli fw-control reset --reset-type chip.
Only a full host reboot restores the published 125 fps.

The degradation is accompanied by cma_alloc: linux,cma: alloc failed, req-size: 256 pages, ret: -16 (-EBUSY) errors in the chip’s onboard Yocto
runtime log, suggesting CMA exhaustion / fragmentation inside the chip-side
Linux.

Affected device

Field	Value
Chip	Hailo-10H
Carrier	Raspberry Pi 5 + official AI Hat 2
Active cooling	Yes (official Pi 5 active cooler)
Chip serial	`FBFBBDDFC4DEF768D6CEF1F9`
SoC ID	`8CD33CE2AA4ABFFE8EAD7E1FCF27CEC6D819C55DADEFEB9FF8C9184DAE93ECCB`
Board SKU-ID	6
LCS	5

Software versions

Component	Version
HailoRT-CLI	5.3.0
Firmware	5.3.0 (release, app)
Driver	`hailo1x_pci` 5.3.0 (`srcversion 91195A5A35A8DAAA5717B7D`)
Host kernel	`6.12.75+rpt-rpi-2712` (Debian 13 Trixie)
Chip-side OS	`Linux 5.15.325.15.32-yocto-standard-01685-g4cd9cfd0e6e7`
HEF	`yolo26s_hailo10h.hef`, md5 `07f0dd9d2f44834f75c123af243e7ec3`, 13,815,808 bytes

PCIe state

LnkCap:  Speed 8GT/s, Width x4
LnkSta:  Speed 8GT/s, Width x1 (downgraded — RPi 5 M.2 slot limitation)
DevCtl:  MaxPayload 256 bytes, MaxReadReq 512 bytes

The Hailo-10H reaches its published 125 fps over this same x1 link when fresh,
so PCIe x1 bandwidth is not the bottleneck.

Reproduction

sudo systemctl stop tetherbox tetherbox-ai     # stop our app, no other Hailo clients
sudo reboot
# wait for boot, then immediately:
for i in $(seq 1 25); do
  sudo hailortcli benchmark /path/to/yolo26s_hailo10h.hef
done

Measured data — degradation curve (single boot, no other workload)

=== H10 degradation repro: 25 runs at default 15s, no sleep between ===
started at: Thu 30 Apr 05:08:01 BST 2026
run  1: fps=124.78 temp=60.70
run  2: fps=124.76 temp=62.82
run  3: fps=124.90 temp=64.36
run  4: fps=124.81 temp=65.70
run  5: fps=124.89 temp=66.90
run  6: fps=124.81 temp=67.96
run  7: fps=124.80 temp=68.94
run  8: fps=124.89 temp=69.83
run  9: fps=124.99 temp=70.69
run 10: fps=124.81 temp=71.29
run 11: fps=124.89 temp=71.96
run 12: fps=124.87 temp=72.60       ← last stable run
run 13: fps=122.80 temp=73.13       ← onset of degradation
run 14: fps=121.38 temp=73.61
run 15: fps=118.53 temp=73.96
run 16: fps=108.49 temp=73.99       ← temperature plateaus here
run 17: fps=101.51 temp=73.88
run 18: fps=98.34 temp=73.82
run 19: fps=96.64 temp=73.87
run 20: fps=94.04 temp=73.90
run 21: fps=91.14 temp=73.82
run 22: fps=88.55 temp=73.68
run 23: fps=85.65 temp=73.58
run 24: fps=83.99 temp=73.53
run 25: fps=81.60 temp=73.43
done at: Thu 30 Apr 05:14:42 BST 2026

Key observation: after run 16, chip temperature plateaus at 73-74 °C and
falls slightly, while throughput continues to drop monotonically.
Throughput is uncorrelated with temperature once steady-state thermal is reached.

In a separate session with extended use the chip degraded as far as 41 fps
(33 % of published) and stayed there until reboot.

Falsified theories — what the cause is NOT

All of the following were measured on this device and ruled out:

Theory	Evidence against
Thermal throttling	Chip plateaus at 73-74 °C from run 16 onwards; throughput continues to drop. After 4 min idle (chip cools to °C) throughput stays at fps (see cooldown experiment below).
Host RPi 5 SoC throttling	`vcgencmd get_throttled = 0x0`
PCIe ASPM L1 / D3hot sleep	`policy=performance`, mid-bench device state confirmed `D0`
PCIe x1 link cap	Same x1 link delivers 125 fps when chip is fresh
Chip warming up (cache cold)	Runs 1-12 at 124-125 fps with rising temp 60→72 °C; degradation begins after warmup
`--power-mode ultra_performance`	Same degraded throughput
`--batch-size 8`	Same degraded throughput
`--time-to-run 60`	Same degraded throughput
HailoRT 4.x driver issues	Reproduces on HailoRT 5.3.0 + driver 5.3.0 (latest), Debian 13 Trixie (the officially supported OS for HailoRT 4.23+)

Smoking gun — chip-side runtime log

sudo hailortcli logs runtime returns logs from the chip’s onboard Yocto Linux
(hostname hailo10, kernel 5.15.325-yocto). On every benchmark invocation:

hailo10 user.info HailoRT-Server: [device.cpp:54] [Device] OS Version: Linux 5.15.325.15.32-yocto-standard-01685-g4cd9cfd0e6e7 #1 SMP PREEMPT Wed Feb 18 10:02:32 UTC 2026 aarch64
hailo10 user.info HailoRT-Server: [control.cpp:89] [control__parse_identify_results] firmware_version is: 5.3.0
hailo10 kern.err kernel: cma: cma_alloc: linux,cma: alloc failed, req-size: 256 pages, ret: -16
hailo10 user.info HailoRT-Server: [vdevice.cpp:474] [configure] Configuring HEF on VDevice took 45.79156 milliseconds

ret: -16 is -EBUSY. Each hailortcli benchmark invocation does an
open → configure → run → close cycle. Every configure attempts a 256-page
(1 MiB) contiguous DMA buffer allocation and these allocations fail. The
runtime appears to fall back to non-contiguous allocations (so inference
continues) but evidently builds up state that throttles throughput as the
chip-side CMA region fragments further.

The trigger correlates with the number of configure cycles, not just
inference time — fresh boot to first degradation = 12 benchmark invocations
(every hailortcli benchmark is a fresh open + configure).

Recovery experiments — what does NOT recover throughput

All of the following were tried on the degraded device with no recovery:

Action	Result	Evidence
Wait for cooldown (4 min idle)	No recovery	See cooldown experiment below
`sudo modprobe -r hailo1x_pci && sudo modprobe hailo1x_pci`	No recovery	dmesg: `SOC Firmware batch was already loaded`, `Firmware loaded in 0 ms` — chip retains state
`echo 1 > /sys/bus/pci/devices/0001:01:00.0/remove` + `echo 1 > /sys/bus/pci/rescan`	No recovery	Same `Firmware loaded in 0 ms`
`hailortcli fw-control reset --reset-type chip`	Bricks the chip until host reboot	`Timeout waiting for vDMA boot data completion`, `Failed writing firmware files over vDMA. err -110`, `Failed activating board -110`

`hailortcli fw-control reset --reset-type chip` — separate bug

The chip reset disconnects the device successfully but the driver then
fails to re-upload firmware over vDMA. Full dmesg:

hailo1x 0001:01:00.0: Firmware batch programming completed for stage 3
hailo1x 0001:01:00.0: Timeout waiting for vDMA boot data completion
hailo1x 0001:01:00.0: Failed writing firmware files over vDMA. err -110
hailo1x 0001:01:00.0: Failed writing SOC firmware on stage 3
hailo1x 0001:01:00.0: SCU log could not be read from device
hailo1x 0001:01:00.0: Firmware load failed
hailo1x 0001:01:00.0: Failed activating board -110
hailo1x 0001:01:00.0: probe with driver hailo1x failed with error -110

The device is unrecoverable in this state until a full host reboot.

Cooldown experiment — proves degradation is not thermal

After the 25-run degradation test (chip at ~52 fps, temp ~58 °C):

[FILLED FROM /tmp/h10_cooldown.log AFTER COMPLETION]

Recovery — only host reboot restores throughput

After sudo reboot, the very first benchmark returns to the published
performance and stays stable for at least 12 runs:

=== uptime ===
 04:59:52 up 1 min,  2 users,  load average: 0.79, 0.48, 0.19
=== fresh boot bench 1 ===  yolo26s: FPS: 124.90  temp mean=59.73
=== fresh boot bench 2 ===  yolo26s: FPS: 125.01  temp mean=61.34
=== fresh boot bench 3 ===  yolo26s: FPS: 124.79  temp mean=62.38
=== fresh boot bench 4 ===  yolo26s: FPS: 124.30  temp mean=63.26
=== fresh boot bench 5 ===  yolo26s: FPS: 124.64  temp mean=64.16
=== fresh boot bench 6 ===  yolo26s: FPS: 124.90  temp mean=64.93

Reboot is the only known recovery path. We have observed the same chip in the
same boot deliver everything from 41 fps to 125 fps depending on cumulative
use.

Asks

Identify and fix the chip-side CMA leak. Each open/configure cycle
appears to leak (or fragment) the chip’s contiguous-memory pool. After
~12 cycles the leak is large enough to throttle inference throughput
monotonically. Cooldown does not free the leaked memory; only a full chip
power cycle does.
Provide a working in-band recovery path that does not require a host
reboot. hailortcli fw-control reset --reset-type chip looks like the
intended path but currently leaves the device in an unrecoverable state
(vDMA boot timeout err -110). Either fix the post-reset firmware
re-upload, or document an alternative supported procedure.
Document any chip-side configuration knobs for the onboard CMA region
size that we could tune from the host (e.g., via the SCU firmware files
uploaded on probe). If none exist, please consider adding them.

Reproducer scripts

The exact reproducer is included verbatim above (for i in $(seq 1 25)). The
chip-side runtime log evidence is captured via sudo hailortcli logs runtime.

Happy to run any further diagnostic Hailo would like, we have permanent
access to the device and can capture additional logs (full dmesg, SCU logs,
runtime logs at any point in the degradation curve) on request.

Michael · April 30, 2026, 2:47pm

Hi @user281,

Thanks for sharing - we will look into it.

Thanks,
Michael.

RGaufman · May 7, 2026, 7:22am

Let me know please, as currently we have to reboot after a few runs of our testing suite, which slows down testing/iterations.

Michael · May 7, 2026, 10:05am

Hi @user281 ,

We are still investigating. I’ll keep you updated here.

Michael · May 25, 2026, 11:21am

Hi @user281,

Unfortunately we can’t reproduce this - we just see the expected degradation to ~111 FPS around thermal throttling.

We would suggest cleaning all Hailo installations and reinstalling the entire runtime SW suite: Raspberry Trixie Error with Guide (Pi 5, AI Hat 2) - #18 by Michael

Thanks,
Michael.

RGaufman · June 14, 2026, 11:04pm

Hi Michael,

Thank you for your response, so if I’m understanding correctly, in that post the guide would downgrade from 5.3.0 to 5.1.1 - is this the official recommendation?

RGaufman · June 14, 2026, 11:17pm

OK, 5.1.1 does NOT reproduce the degradation. Flat at 124.6 fps across 25 runs. The problem only appears when upgrading to 5.3.0. Any solution to use 5.3.0 or is the recommendation to use 5.1.1?

Hailo-10H throughput degrades irreversibly within minutes of continuous use (125 → 41 fps), only host reboot recovers

Affected device

Software versions

PCIe state

Reproduction

Measured data — degradation curve (single boot, no other workload)

Falsified theories — what the cause is NOT

Smoking gun — chip-side runtime log

Recovery experiments — what does NOT recover throughput

hailortcli fw-control reset --reset-type chip — separate bug

Cooldown experiment — proves degradation is not thermal

Recovery — only host reboot restores throughput

Asks

Reproducer scripts

`hailortcli fw-control reset --reset-type chip` — separate bug