direct log evidence + timing data (Hailo-10H, hailo-ollama v5.3.0)
Finding: HailoRT does not destroy vdevice or clear SRAM between sequential model loads in hailo-ollama. Direct runtime log evidence below.
Hardware: Raspberry Pi 5, Hailo AI HAT+ 2 (Hailo-10H 8GB 40TOPS)
Runtime: hailo-ollama v5.3.0, HailoRT 4.23, Debian Bookworm
Hardware: Raspberry Pi 5, Hailo AI HAT+ 2 (Hailo-10H, 8GB, 40 TOPS) Runtime: hailo-ollama v5.3.0, HailoRT 4.23, Debian Bookworm 64-bit Models tested: deepseek_r1:1.5b, llama3.2:1b, qwen2.5-coder:1.5b, qwen2.5:1.5b, qwen2:1.5bApplication: Sequential multi-model RAG inference pipeline
Primary Finding
The HailoRT runtime log captured on this hardware shows two models loading back-to-back with no vdevice destroy, SRAM release, or memory free operation between them:
[2026-06-05 11:39:16] [vdevice.cpp:228] Creating vdevice — qwen2.5:1.5b
[2026-06-05 11:39:16] [llm.cpp:237] Sending 6 HEF chunks to server
[2026-06-05 11:39:21] GenerationThread: Finished loading model 'qwen2.5:1.5b'
[2026-06-05 11:39:21] GenerationThread: got prompt
--- no vdevice destroy, no SRAM release ---
[2026-06-05 11:41:45] [vdevice.cpp:228] Creating vdevice — llama3.2:1b
[2026-06-05 11:41:45] [llm.cpp:237] Sending 6 HEF chunks to server
[2026-06-05 11:41:49] GenerationThread: Finished loading model 'llama3.2:1b'
The same log file contains this from April 26, 2026 on the same hardware:
[2026-04-26 17:13:13] [vdevice.cpp:227] Creating vdevice — model request
[2026-04-26 17:13:18] [error] Failed to create vdevice.
not enough free devices. requested: 1, found: 0
[2026-04-26 17:13:18] CHECK_SUCCESS failed: HAILO_OUT_OF_PHYSICAL_DEVICES(74)
[2026-04-26 17:14:25] [vdevice.cpp:227] Creating vdevice — retry 67 seconds later
The prior model’s vdevice was still locked when the next model attempted device acquisition. The retry succeeded 67 seconds later — suggesting the prior vdevice timed out rather than being explicitly released.
Observed Effects
Five models run sequentially, two questions each, two runs. All 1.5b HEF models, identical prompts, fixed temperature.
| Model | R1 Q1 | R2 Q1 | R1 Q2 | R2 Q2 | R1 Total | R2 Total |
|---|---|---|---|---|---|---|
| 1st | 144.9s | 138.0s | 61.8s | 104.9s | 430.3s | 476.8s |
| 2nd | 47.3s | 59.7s | 43.9s | 42.5s | 190.5s | 226.8s |
| 3rd | 38.3s | 47.2s | 42.8s | 37.9s | 169.7s | 195.9s |
| 4th | 39.8s | 24.9s | 59.5s | 57.5s | 163.9s | 132.7s |
| 5th | 67.1s | 55.6s | 25.9s | 111.0s | 177.0s | 308.2s |
Notable:
-
Model E Q2: 25.9s (Run 1) vs 111.0s (Run 2) — 4.3x variance on identical prompt
-
Model A (1st in sequence) consistently slowest and most hallucinated
-
Model D (4th in sequence) consistently fastest and most accurate
-
The same models on CPU ollama show none of this variance
Answer quality correlated with sequence position. Models run later produced better-grounded answers using domain terminology not expected from 1.5b base training. The 4th model consistently outperformed the 1st on identical questions with verifiable ground-truth answers. This is consistent with residual SRAM data from prior model inferences functioning as unintended context for subsequent models.
Agency identification errors varied between sessions on fixed-weight models — different wrong answers to the same question across runs, consistent with variable SRAM state seeding different noise trajectories.
Hypothesis
Each model load sends 6 HEF chunks to the NPU server. With no documented SRAM release between loads, the SRAM state when Model B’s chunks arrive contains whatever Model A left behind. If HailoRT’s chunk-write process writes incrementally rather than overwriting the full SRAM footprint, residual data from Model A remains in unwritten sectors during Model B’s inference.
The key question: does HailoRT’s HEF chunk-write overwrite the full SRAM footprint, or write incrementally into allocated sectors? If incremental, residual activations from Model A are present in Model B’s working memory. This would explain both the timeout behavior (residual data treated as query input on finite deterministic prompts) and the quality improvement for later-sequence models (residual domain vocabulary from prior inferences).
Blake’s temperature=0.3 finding in this forum is consistent with this mechanism — lower temperature sharpens argmax selection and suppresses noise in the probability distribution, which is exactly what residual SRAM data would introduce.
Corroborating Evidence
NN-Core CRC errors (community post #19014, March 27, 2026) during llama3.2:3b inference:
[d2h_events_parser.cpp:428] Got NN-Core CRC Error in CSM unit (x3)
[device_internal.cpp:572] Aborting Infer — CSM CRC-error from NN-core
A CRC error in the CSM unit is the hardware reporting that data read back from on-chip SRAM does not match what was written. The hardware has already located the failure point.
llama3.2:3b removed from v5.2.0 — runs correctly on CPU ollama, fails specifically on HailoRT. Failure is runtime-specific, not model-specific.
Proposed Verification
-
Cold boot baseline — single model immediately after power cycle vs same model after extended sequential session. Eliminates all prior state.
-
Per-model restart isolation — full hailo-ollama restart between each model load. If answer quality equalizes across sequence positions, SRAM state is confirmed as the differentiating variable.
-
Sequence reversal — run models in reverse order. If quality improvement pattern reverses, sequence dependency is confirmed.
-
HEF chunk write verification — does chunk-write overwrite full SRAM footprint or write incrementally? This is the key architectural question.
I have run Tests 1 and 2 on my hardware. Per-model hailo-ollama restart (Test 2) eliminates the timeout variance. Cold boot (Test 1) eliminates the first-model quality degradation. Both confirm SRAM state as the variable.
I am a retired systems architect, not an AI engineer. I have not examined HailoRT source. What I have is direct runtime log evidence from my own hardware, two test runs with timing data, and a hypothesis grounded in decades of debugging runtime memory management issues. This is offered in the hope that it helps locate the root cause and improves the platform.
Happy to run additional verification tests and provide results on request.
Barry Berg, Apple Valley MN — June 2026