HailoRT SRAM not cleared between sequential model loads

Barry_Berg · June 18, 2026, 4:08pm

direct log evidence + timing data (Hailo-10H, hailo-ollama v5.3.0)
Finding: HailoRT does not destroy vdevice or clear SRAM between sequential model loads in hailo-ollama. Direct runtime log evidence below.

Hardware: Raspberry Pi 5, Hailo AI HAT+ 2 (Hailo-10H 8GB 40TOPS)
Runtime: hailo-ollama v5.3.0, HailoRT 4.23, Debian Bookworm

Hardware: Raspberry Pi 5, Hailo AI HAT+ 2 (Hailo-10H, 8GB, 40 TOPS) Runtime: hailo-ollama v5.3.0, HailoRT 4.23, Debian Bookworm 64-bit Models tested: deepseek_r1:1.5b, llama3.2:1b, qwen2.5-coder:1.5b, qwen2.5:1.5b, qwen2:1.5bApplication: Sequential multi-model RAG inference pipeline

Primary Finding

The HailoRT runtime log captured on this hardware shows two models loading back-to-back with no vdevice destroy, SRAM release, or memory free operation between them:

[2026-06-05 11:39:16] [vdevice.cpp:228] Creating vdevice — qwen2.5:1.5b
[2026-06-05 11:39:16] [llm.cpp:237]    Sending 6 HEF chunks to server
[2026-06-05 11:39:21] GenerationThread: Finished loading model 'qwen2.5:1.5b'
[2026-06-05 11:39:21] GenerationThread: got prompt

                     --- no vdevice destroy, no SRAM release ---

[2026-06-05 11:41:45] [vdevice.cpp:228] Creating vdevice — llama3.2:1b
[2026-06-05 11:41:45] [llm.cpp:237]    Sending 6 HEF chunks to server
[2026-06-05 11:41:49] GenerationThread: Finished loading model 'llama3.2:1b'

The same log file contains this from April 26, 2026 on the same hardware:

[2026-04-26 17:13:13] [vdevice.cpp:227] Creating vdevice — model request
[2026-04-26 17:13:18] [error] Failed to create vdevice.
                     not enough free devices. requested: 1, found: 0
[2026-04-26 17:13:18] CHECK_SUCCESS failed: HAILO_OUT_OF_PHYSICAL_DEVICES(74)
[2026-04-26 17:14:25] [vdevice.cpp:227] Creating vdevice — retry 67 seconds later

The prior model’s vdevice was still locked when the next model attempted device acquisition. The retry succeeded 67 seconds later — suggesting the prior vdevice timed out rather than being explicitly released.

Observed Effects

Five models run sequentially, two questions each, two runs. All 1.5b HEF models, identical prompts, fixed temperature.

Model	R1 Q1	R2 Q1	R1 Q2	R2 Q2	R1 Total	R2 Total
1st	144.9s	138.0s	61.8s	104.9s	430.3s	476.8s
2nd	47.3s	59.7s	43.9s	42.5s	190.5s	226.8s
3rd	38.3s	47.2s	42.8s	37.9s	169.7s	195.9s
4th	39.8s	24.9s	59.5s	57.5s	163.9s	132.7s
5th	67.1s	55.6s	25.9s	111.0s	177.0s	308.2s

Notable:

Model E Q2: 25.9s (Run 1) vs 111.0s (Run 2) — 4.3x variance on identical prompt
Model A (1st in sequence) consistently slowest and most hallucinated
Model D (4th in sequence) consistently fastest and most accurate
The same models on CPU ollama show none of this variance

Answer quality correlated with sequence position. Models run later produced better-grounded answers using domain terminology not expected from 1.5b base training. The 4th model consistently outperformed the 1st on identical questions with verifiable ground-truth answers. This is consistent with residual SRAM data from prior model inferences functioning as unintended context for subsequent models.

Agency identification errors varied between sessions on fixed-weight models — different wrong answers to the same question across runs, consistent with variable SRAM state seeding different noise trajectories.

Hypothesis

Each model load sends 6 HEF chunks to the NPU server. With no documented SRAM release between loads, the SRAM state when Model B’s chunks arrive contains whatever Model A left behind. If HailoRT’s chunk-write process writes incrementally rather than overwriting the full SRAM footprint, residual data from Model A remains in unwritten sectors during Model B’s inference.

The key question: does HailoRT’s HEF chunk-write overwrite the full SRAM footprint, or write incrementally into allocated sectors? If incremental, residual activations from Model A are present in Model B’s working memory. This would explain both the timeout behavior (residual data treated as query input on finite deterministic prompts) and the quality improvement for later-sequence models (residual domain vocabulary from prior inferences).

Blake’s temperature=0.3 finding in this forum is consistent with this mechanism — lower temperature sharpens argmax selection and suppresses noise in the probability distribution, which is exactly what residual SRAM data would introduce.

Corroborating Evidence

NN-Core CRC errors (community post #19014, March 27, 2026) during llama3.2:3b inference:

[d2h_events_parser.cpp:428] Got NN-Core CRC Error in CSM unit  (x3)
[device_internal.cpp:572]  Aborting Infer — CSM CRC-error from NN-core

A CRC error in the CSM unit is the hardware reporting that data read back from on-chip SRAM does not match what was written. The hardware has already located the failure point.

llama3.2:3b removed from v5.2.0 — runs correctly on CPU ollama, fails specifically on HailoRT. Failure is runtime-specific, not model-specific.

Proposed Verification

Cold boot baseline — single model immediately after power cycle vs same model after extended sequential session. Eliminates all prior state.
Per-model restart isolation — full hailo-ollama restart between each model load. If answer quality equalizes across sequence positions, SRAM state is confirmed as the differentiating variable.
Sequence reversal — run models in reverse order. If quality improvement pattern reverses, sequence dependency is confirmed.
HEF chunk write verification — does chunk-write overwrite full SRAM footprint or write incrementally? This is the key architectural question.

I have run Tests 1 and 2 on my hardware. Per-model hailo-ollama restart (Test 2) eliminates the timeout variance. Cold boot (Test 1) eliminates the first-model quality degradation. Both confirm SRAM state as the variable.

I am a retired systems architect, not an AI engineer. I have not examined HailoRT source. What I have is direct runtime log evidence from my own hardware, two test runs with timing data, and a hypothesis grounded in decades of debugging runtime memory management issues. This is offered in the hope that it helps locate the root cause and improves the platform.

Happy to run additional verification tests and provide results on request.

Barry Berg, Apple Valley MN — June 2026

giladn · June 18, 2026, 4:12pm

Hi, thank you for your report.
It looks like you did some manual installation and didn’t use RPi official installation flow.
There are several mismatches in your report.
First please update to Trixie to get the current packages for RPi.
4.23 is for hailo8 not hailo10.
Please follow official instructions https://www.raspberrypi.com/documentation/computers/ai.html

Hope this will solve your issue.

Barry_Berg · June 18, 2026, 6:32pm

Hi giladn,

I am not interested in Hailo8. 10H 40TOPS is the hardware I am running and llama 3B is what I was trying to load. My project is much more suited to 10H rather than the Video emphasis of HAILO8. Currently using the Ollama 3 packages on the Pi CPU, running Trixie. But found similarities to a couple of other bugs which appear to be SRAM induced. and may be the reason the package I was interested in was withdrawn because of to much processing time

Regards,

Barry