Hailo-10H on RPi5: Undocumented API findings + DFC conversion failures with Transformer-based models (SwinV2/ViT/ConvNeXt)

user442 · March 22, 2026, 11:11am

Hailo-10H on Raspberry Pi 5: Undocumented API findings + DFC conversion failures with Transformer-based models

I have been developing a local AI image management application (eauesque / YU AI Manager) integrating Hailo-10H (AI HAT 2 for Raspberry Pi 5) with HailoRT v5.2.0 and DFC v5.2.0. This post shares both undocumented findings from low-level API development and specific DFC conversion failures, in hopes that Hailo engineers can provide guidance.

What I have implemented (all using low-level `hailo_platform` API)

All working features use pre-compiled HEF files from the official Model Zoo. I intentionally avoided hailo-apps and hailo-ollama, instead building directly on hailo_platform wheel:

CLIP semantic search — VDevice.create_infer_model() + uint8 dequantization pipeline
YOLO object detection — same InferModel API
LLM / VLM chat — hailo_platform.genai.LLM / VLM
Whisper speech-to-text — hailo_platform.genai.Speech2Text
VDevice exclusive-access device manager — automatic switching between CLIP / YOLO / LLM / VLM / S2T on a single VDevice (hailo-apps has no equivalent)
Multi-backend fallback — Hailo → CoreML → ONNX Runtime, transparent auto-switching
LAN distributed inference — work-stealing parallel tagging across multiple machines

Undocumented behaviors I had to discover by trial and error

All of the following were resolved through error messages and source code inspection, as no documentation existed:

InferModel API is the correct API — The legacy VStreams API (InferVStreams, ConfigureParams.create_from_hef) returns HAILO_NOT_IMPLEMENTED on Hailo-10H. This is not documented anywhere.
Output buffers must be uint8 — Allocating float32 buffers causes buffer size mismatch. You must allocate uint8 and dequantize afterward.
input() / output() are properties, not methods — Inconsistent with other parts of the API.
quant_info retrieval — infer_model.output().quant_info provides scale / zero_point for dequantization. No documentation exists for this.
hailo-ollama exclusivity — VDevice usage requires stopping hailo-ollama first. The resulting error message does not indicate the cause clearly.

I’m sharing these in case they are useful to other developers or to Hailo for documentation improvements.

DFC conversion failures: Transformer-based models (March 2026, DFC v5.2.0)

I attempted to convert WD-Tagger models (Danbooru tag classification) from ONNX to HEF. All three failed at the parser stage, before reaching optimization:

Model	Size	Error	Stage
wd-swinv2-tagger-v3	446 MB	`IndexError` in `_convert_axes_to_nhwc`	Pre-optimization
wd-vit-tagger-v3	362 MB	Same	Pre-optimization
wd-convnext-tagger-v3	377 MB	`UnsupportedShuffleLayerError`	Pre-optimization

500 calibration images were prepared but never reached the quantization stage.

Root cause (as I understand it): The DFC ONNX parser cannot handle LayerNormalization (multi-dimensional axis conversion) and certain Transpose patterns. These are fundamental building blocks of SwinV2, ViT, and ConvNeXt architectures — the majority of models developed since 2022.

I note that CLIP ViT exists in the Model Zoo as a working HEF, which suggests Hailo may have applied internal graph transformations that are not available to end users through DFC.

Questions / feature requests

Is there any plan to support LayerNormalization and general Transpose patterns in DFC? These are required for essentially all Transformer-based vision models.
Is an ONNX Runtime Execution Provider for Hailo-10H under consideration? This would be the most developer-friendly solution — eliminating the conversion step entirely. For comparison, Ryzen AI (XDNA) requires only ort.InferenceSession("model.onnx", providers=["DmlExecutionProvider"]). The absence of an equivalent for Hailo-10H is a significant barrier.
Is there any workaround or additional tooling for converting SwinV2 / ViT / ConvNeXt models that is not publicly documented?

Any guidance from Hailo engineers would be greatly appreciated.

Environment: Raspberry Pi 5 (aarch64), AI HAT 2, HailoRT v5.2.0, DFC v5.2.0 (x86_64 Linux), Python 3.11
Project: GitHub - eauesque/yu_ai_manager: AI-generated image metadata manager — browse, search, tag and rate your Stable Diffusion / NovelAI / ComfyUI outputs. Quart + SQLite + TypeScript WebUI with Tauri desktop app. · GitHub

user442 · April 7, 2026, 3:09pm

Subject: Follow-up: WD-Tagger DFC Conversion — Results on DFC v5.3.0

Hi all,

In March 2026, I posted a report on DFC conversion failures for WD-Tagger models (SwinV2, ViT, ConvNeXt) under DFC v5.2.0. I have now retested all three models under DFC v5.3.0. This is a follow-up with results, observations, and some additional findings I hope will be useful to the community.

DFC v5.3.0 Re-test Results

Model	Size	v5.2.0	v5.3.0	Change
wd-swinv2-tagger-v3	446 MB	`IndexError` in `_convert_axes_to_nhwc`	Same	None
wd-vit-tagger-v3	362 MB	Same	Same (after onnxsim retry)	Retry flow added
wd-convnext-tagger-v3	377 MB	`UnsupportedShuffleLayerError`	Same + additional `UnsupportedModelError`	Errors increased

All three models still fail at the parser stage. The 500 calibration images prepared for quantization remain unused.

What Changed in v5.3.0 — Encouraging Signs

While the failures persist, v5.3.0 shows clear evidence of active work toward Transformer support:

_create_layer_normalization_layer method added — This method did not exist in v5.2.0. DFC now explicitly attempts to handle LayerNormalization operators. The internal implementation is not yet complete — the call to _convert_axes_to_nhwc still raises IndexError: list index out of range — but the method’s presence is a strong signal that this is actively being worked on.
onnxsim simplification + retry flow added — DFC now automatically simplifies the ONNX model and retries parsing on failure. The simplified model is saved as model.sim.onnx. The retry fails at the same point, but the infrastructure for handling difficult models has improved.
End node recommendations — ConvNeXt now produces specific end node suggestions and prompts an interactive retry. A meaningful step forward in error recovery UX.

I read these changes as Hailo engineering actively working toward the same goal. I hope this follow-up serves as useful signal for prioritization.

What We Built in the Meantime

Since we could not wait for DFC support, we implemented alternatives and documented findings we hope are useful to other Hailo-10H developers:

1. ONNX Runtime multi-backend CLIP encoder DFC conversion being unavailable pushed us to implement a CPU/CUDA/ROCm/DirectML/OpenVINO/CoreML fallback chain using ONNX Runtime. One useful finding: the vector output is compatible with the Hailo HEF-based CLIP encoder — both use the same openai/clip-vit-base-patch16 base, producing 512-dimensional embeddings in the same space. Existing Hailo-built indexes and ONNX-built vectors can coexist.

2. Shared VDevice manager pattern Through trial and error, we documented the undocumented VDevice exclusivity constraint (HAILO_OUT_OF_PHYSICAL_DEVICES(74)) and built a shared singleton manager that allows multiple models (YOLO + CLIP + LLM + VLM + Speech2Text) to coexist on a single VDevice. Verified on HailoRT 5.2.0 and 5.3.0. We have written this up as a reusable pattern document.

3. HailoRT v5.2.0 → v5.3.0 migration notes Key findings for anyone upgrading:

Device node renamed: /dev/hailort0 → /dev/h1x-0 (new driver: hailo1x_pci). Python code via VDevice() is unaffected; Docker device passthrough requires updating.
FormatType.FLOAT32 limitation present in v5.2.0 is resolved in v5.3.0.
All v5.2.0 HEF files load and run correctly under v5.3.0 runtime (7 models verified on Raspberry Pi 5 + AI HAT 2).
SegmentInfo attributes renamed to start_sec / end_sec / text.
numpy < 2 constraint removed in v5.3.0.

These documents are available in our repository: eauesque/yu_ai_manager

Request (unchanged from March)

The core request stands:

Fix _convert_axes_to_nhwc for multi-dimensional LayerNormalization — the method is now being called, the axis mapping just needs to handle non-NCHW inputs correctly.
ONNX Runtime Execution Provider for Hailo-10H — this would make DFC optional and resolve the issue structurally for the entire post-2022 vision model ecosystem.

We will re-test again when the next DFC release is available and post another follow-up. We are rooting for this to land.

Environment: WSL2 Ubuntu, AMD Ryzen 5 5600X, DFC v5.3.0 Models: SmilingWolf/wd-{swinv2,vit,convnext}-tagger-v3 (HuggingFace)

user442 · April 7, 2026, 3:10pm

Subject: Hailo-10H Multi-Model Coexistence: Benchmarks, VDevice Sharing Pattern, and hailo-ollama Integration (HailoRT 5.3.0)

Hi all,

I want to share practical findings from running multiple models concurrently on a single Hailo-10H (Raspberry Pi 5 + AI HAT 2), including measured latency/throughput numbers and a reusable VDevice sharing pattern. All results are from HailoRT 5.3.0 on actual hardware.

Background

When building an application that uses YOLO, CLIP, LLM, VLM, and Speech2Text on the same device, the first obstacle is HAILO_OUT_OF_PHYSICAL_DEVICES(74). The constraint is real — one physical device, one VDevice per process — but it is workable once you understand the rules.

The common failure modes we hit:

Background preloader threads racing to create separate VDevice() instances
is_available() checks that destructively create and abandon a VDevice (GC timing makes this unreliable)
Model switching via del self.vd without calling vd.release() explicitly
Independent modules each calling VDevice() without coordination

The solution is a shared singleton VDevice manager with owner-based access. We have written this up as a reusable pattern with full code: VDEVICE_SHARING_PATTERN.md

Key points:

One VDevice per process, shared by all models
Each consumer registers with an owner name ("yolo", "clip", "llm", etc.)
is_available() must never create a VDevice — check import hailo_platform only
Call vd.release() explicitly on shutdown; del alone is not sufficient
Use VDevice.create_params().group_id to share across processes

Benchmark Results (Pi5 + HailoRT 5.3.0)

Vision-only concurrency

50 iterations each, models: CLIP image encoder + YOLOv8n + CLIP text encoder

Scenario	Model	Median	p95	Throughput	Slowdown
Solo	clip_image	18.7 ms	19.0 ms	53.2/s	×1.00
Solo	yolo	14.8 ms	15.8 ms	66.7/s	×1.00
2-way parallel	clip_image	23.9 ms	24.7 ms	41.9/s	×1.27
2-way parallel	yolo	23.8 ms	24.9 ms	41.7/s	×1.60
3-way parallel	clip_image	46.9 ms	47.1 ms	21.4/s	×2.49
3-way parallel	clip_text	46.9 ms	47.0 ms	21.6/s	—
3-way parallel	yolo	46.8 ms	47.4 ms	21.6/s	×3.09

Combined throughput:

Parallelism	Combined thr	Efficiency
1	60/s	100%
2 (CLIP + YOLO)	82.8/s	69%
3 (CLIP + text + YOLO)	64.0/s	46%

Observation: HailoRT scheduler applies strict equal time-slicing. p95 ≈ median (within 1 ms), confirming deterministic round-robin. The formula latency(N) = N × solo_latency holds reliably for vision-only workloads, which makes router capacity planning straightforward arithmetic.

2-way parallel is practical (69% efficiency). 3-way and beyond shows diminishing returns — combined throughput actually drops below 2-way, so offloading to external CPU/GPU becomes preferable at that point.

GenAI + vision concurrency

hailo-ollama (qwen2.5:1.5b) running LLM generation while yu_ai_manager runs CLIP image encoding concurrently:

Metric	Value
CLIP solo median	18.7 ms
CLIP under LLM load median	152.0 ms
CLIP slowdown	×8.08
LLM throughput (under CLIP load)	~5.6 tok/s

The equal time-slicing model breaks down when GenAI is involved. LLM takes a disproportionately large scheduler slice. Any SLO tighter than ~200 ms for vision tasks is violated when LLM is active. Applications with strict vision latency requirements should queue or fallback to external GPU/CPU when GenAI is running.

hailo-ollama Coexistence (HailoRT 5.3.0)

Cross-process VDevice sharing with hailo-ollama works cleanly on 5.3.0:

bash

HAILO_OLLAMA_VDEVICE_GROUP_ID=MY_APP_SHARED \
OLLAMA_HOST=127.0.0.1:18765 \
/usr/bin/hailo-ollama

Set the same group ID on the application side:

bash

HAILO_VDEVICE_GROUP_ID=MY_APP_SHARED python your_app.py

lsof /dev/h1x-0 confirms both processes hold the fd simultaneously, with the HailoRT scheduler time-slicing between them. Note: use the system package /usr/bin/hailo-ollama (HailoRT 5.3.0 linked), not a user-built binary from 5.2.0.

Proposed Router Capacity Model

Based on these measurements, a simple capacity model for routing inference requests:

predict_latency(new_request):
    if active_genai_count > 0:
        → reject or queue (SLO violation likely)
    elif active_vision_count >= 2:
        → offload to external CPU/GPU
    else:
        → estimated_latency = solo_latency × (active_vision_count + 1)

max_parallel_hailo_vision = 2 is our recommended parameter for production use.

Additional Notes

All v5.2.0 HEF files load correctly under HailoRT 5.3.0 runtime (7 models verified)
Device node renamed: /dev/hailort0 → /dev/h1x-0 (driver: hailo1x_pci). Python via VDevice() is unaffected; Docker device passthrough requires updating
FormatType.FLOAT32 limitation present in 5.2.0 is resolved in 5.3.0
Full migration notes: HAILORT_5_3_0_MIGRATION.md

Michael · April 9, 2026, 8:15am

Hi @user442 ,

Thanks for the very detailed posts.
We will review them carefully.

Thanks,

user442 · April 11, 2026, 6:15am

Hello Michael, thank you for your response. I have since investigated the scheduler priority API in detail and would like to share the results.

Following my earlier post on undocumented API behaviors and DFC conversion failures, I have been investigating multi-task scheduling on a single VDevice (Hailo-10H, HailoRT v5.3.0, Raspberry Pi 5).

This post shares benchmark results for ConfiguredInferModel.set_scheduler_priority(), set_scheduler_threshold(), and set_scheduler_timeout() under real concurrent workloads.

-–

### Environment

Hardware: Raspberry Pi 5 (aarch64), AI HAT 2 (Hailo-10H)
HailoRT: v5.3.0
Python: 3.11
Models: CLIP (vision) + Qwen2.5-1.5B (LLM via hailo-ollama or hailo_platform.genai)
Project: https://github.com/eauesque/yu_ai_manager

-–

### Findings: valid priority range

Through trial and error, I found that the valid range for set_scheduler_priority() is approximately **0–10**.

0, 1, 10: accepted
100, 255: rejected with invalid argument error

This range is not documented anywhere I could find.

-–

### Scenario 1: In-process Hailo GenAI LLM (hailo_platform.genai.LLM) + CLIP

Scenario	median	p95	mean	slowdown
solo (no LLM)	18.9 ms	19.1 ms	18.9 ms	×1.00
LLM overlap + prio=0	18.8 ms	21.4 ms	18.9 ms	×1.00
LLM overlap + prio=10	18.9 ms	21.4 ms	19.0 ms	×1.00
LLM + prio=10, threshold=2, timeout=1ms	20.0 ms	22.2 ms	20.1 ms	×1.06

**Observation:** In-process LLM contention causes almost no CLIP latency degradation. Priority settings had no measurable effect; threshold=2 + timeout=1ms slightly worsened latency.

Note: hailo_platform.genai.LLM exposes no scheduler/priority API, so only the vision-side ConfiguredInferModel could be tuned.

-–

### Scenario 2: hailo-ollama (separate process) + CLIP

Scenario	median	p95	mean	slowdown
solo (no ollama)	18.8 ms	18.9 ms	18.7 ms	×1.00
ollama active + prio=0	152.1 ms	155.1 ms	152.4 ms	×8.14
ollama active + prio=10	152.1 ms	153.9 ms	152.2 ms	×8.13
ollama + prio=10, threshold=2, timeout=1ms	151.9 ms	153.4 ms	106.1 ms	×5.66

**Observation:** hailo-ollama running in a separate process causes an **×8 CLIP latency degradation** that set_scheduler_priority() cannot mitigate. The difference between prio=0 and prio=10 is 0.1% — effectively zero. threshold + timeout reduced the mean slightly but left p50/p95 unchanged, suggesting it trimmed some outlier iterations without improving the structural contention.

-–

### Conclusion / architectural implication

Based on these results, the scheduler priority API is not an effective tool for protecting latency-sensitive vision inference when hailo-ollama is running in a separate process. The ×8 slowdown appears to stem from VDevice arbitration at a level below what ConfiguredInferModel scheduler settings can reach.

For my router implementation, I have adopted the following policy: **do not run latency-sensitive vision tasks concurrently with hailo-ollama; use queue or fallback instead.** This works for my use case, but it means vision and LLM inference are effectively serialized.

-–

### Questions

Is the ×8 slowdown from hailo-ollama contention a known limitation of HailoRT v5.3.0, and is it expected to improve in v5.4.0?
Is there a recommended way to share a VDevice between a ConfiguredInferModel (vision) and hailo-ollama (LLM) with latency isolation?
Is there any priority or QoS mechanism that operates at the hailo-ollama level, or between separate processes sharing a VDevice?

Any guidance would be appreciated.

-–

*Reproduction scripts:*

tests/hailo_router_baseline/router_verify_test4_priority.py
tests/hailo_router_baseline/router_verify_test4b_ollama_priority.py

user442 · April 15, 2026, 6:49am

HailoRT 5.3.0 — Undocumented behaviors and CMA leak bugs

Following up with findings from HailoRT 5.3.0 testing on Raspberry Pi 5 + AI HAT+. Several of these are bugs rather than configuration issues.

1. `temperature=0.0` rejected — undocumented breaking change in 5.3.0

LLM.generate() with temperature=0.0 raises HAILO_INVALID_ARGUMENT:

[HailoRT] [error] CHECK failed - Temperature should be higher than '0'. received: '0'

This triggers on the very first generate() call after loading, regardless of model state or concurrent usage. Any OpenAI-compatible client that sends temperature=0 by default will hit this immediately.

This was not the case in HailoRT 5.2.0. It is not documented in the 5.3.0 release notes.

Workaround:

python

temperature = max(temperature, 0.01)

2. GenAI × 2 concurrent loading confirmed (LLM + Speech2Text on same VDevice)

Tested on Pi 5 + AI HAT+, HailoRT 5.3.0, cma=512M:

Model	Load time	CMA used
qwen2.5-1.5b-chat	7.1s	~228 MB
Whisper-tiny (same VDevice)	1.1s	~18 MB additional
Whisper-base (same VDevice)	1.4s	~8 MB additional

Both LLM + Whisper-tiny (~246 MB total) and LLM + Whisper-base (~408 MB total) load and run successfully on the same VDevice. Speech2Text.transcribe() and LLM.generate() (temperature > 0) work correctly in sequence.

Note: YOLO InferModel uses 0 MB CMA (confirmed). YOLO is not a factor in CMA budget calculations.

3. CMA is not reclaimed until full system reboot — `hailo1x_pci` driver bug

CMA allocated by HailoRT is not returned under any of the following conditions:

Action	CMA returned?
`VDevice.release()`	No
Server process exit (fd closed)	No
`rmmod hailo1x_pci && modprobe hailo1x_pci`	No
Full system reboot (PCIe power-cycle)	Yes

This appears to be a bug in hailo1x_pci: dma_free_coherent() is not called on device fd close or module unload. Only a PCIe power-cycle restores the allocations.

Implication: VDevice must be treated as a process-lifetime singleton. Evict-and-reload cycles accumulate CMA consumption permanently within a boot session.

4. VDevice creation failure leaks CMA permanently

When Speech2Text.create() (or any GenAI model load) fails due to OOM, the CMA allocated during the partial VDevice.create() is not returned:

[HailoRT] mmap_buffer.cpp: Failed to mmap buffer with errno:12
HAILO_OUT_OF_HOST_MEMORY(3)
RuntimeError: unmatched '}' in format string   ← malformed OOM error message in 5.3.0

Observed leak per failed attempt: −20 to −32 MB of CMA, permanently.

Sequence:

VDevice.create() → succeeds, allocates ~32 MB CMA
Speech2Text.create() → fails (OOM)
VDevice.__exit__() → CMA not returned

This creates a feedback loop: each failed load attempt reduces available CMA, making the next attempt more likely to fail, until the system requires a full reboot.

Note: The malformed error message (unmatched '}' in format string) is a secondary bug in HailoRT 5.3.0’s OOM error formatter.

Environment: Raspberry Pi 5 8 GB (aarch64), AI HAT+, HailoRT 5.3.0, Raspberry Pi OS (Linux 6.12.62+rpt-rpi-2712), CMA=512M
Project: eauesque/yu_ai_manager

user442 · April 15, 2026, 6:49am

Raspberry Pi 5 CMA ceiling and config.txt optimization for Hailo AI HAT+

1. `cma=1G` and `cma=768M` silently fail — `numa=fake=8` node boundary

The default Raspberry Pi OS kernel applies numa=fake=8, splitting 8 GB of RAM into 8 virtual NUMA nodes of 1 GB each. Linux CMA must be allocated as contiguous memory within a single NUMA node boundary at boot time.

cma=1G and cma=768M silently fail — CmaTotal becomes 0 with no kernel panic, no error message. The system boots normally as if the setting were never applied.

Always verify after any cmdline.txt change:

bash

grep CmaTotal /proc/meminfo

Setting	Result
`cma=1G`	Silent failure → CmaTotal=0
`cma=768M`	Silent failure → CmaTotal=0 (verified 2026-04-15)
`cma=512M`	Confirmed stable ✓ (verified 2026-04-15) ← recommended
`cma=256M`	Stable, but tight for LLM + Whisper concurrent use

2. `dtoverlay=vc4-kms-v3d` + `max_framebuffers=2` consumes ~157 MB CMA at boot

The default config.txt includes dtoverlay=vc4-kms-v3d and max_framebuffers=2. The Pi firmware pre-allocates CMA framebuffers at boot time. This allocation persists even when the Linux DRM driver subsequently fails to initialize (e.g. Couldn't get core clock in dmesg).

config.txt state	CmaFree at boot
`vc4-kms-v3d` + `max_framebuffers=2` enabled (default)	~257 MB
Both commented out	~305 MB (+48 MB)

On a headless server, this wastes ~157 MB of CMA that Hailo models need.

Fix:

ini

#dtoverlay=vc4-kms-v3d
#max_framebuffers=2

3. `camera_auto_detect=1` loads `pisp_be` and consumes additional CMA

camera_auto_detect=1 (default) loads the Pi ISP backend (pisp_be) and videobuf2_dma_contig at boot, consuming additional CMA even with no camera attached.

config.txt state	CmaFree at boot
After disabling vc4-kms-v3d	~305 MB
+ `camera_auto_detect=0` + `display_auto_detect=0`	~426 MB

Fix:

ini

camera_auto_detect=0
display_auto_detect=0

Note: camera_auto_detect=0 affects CSI cameras only. USB cameras (UVC) are unaffected.

4. Headless CMA budget with `cma=512M` and optimized config

Headless baseline after all optimizations: ~98 MB consumed at boot, leaving ~426 MB free.

Configuration	CMA used
LLM (qwen2.5-1.5b) alone	~228 MB
LLM + Whisper-tiny	~246 MB
LLM + Whisper-base	~408 MB

LLM + Whisper-base simultaneous loading fits with ~80 MB headroom on cma=512M with headless optimization.

Recommended minimal `config.txt` for headless AI HAT+ use

ini

auto_initramfs=1
arm_64bit=1
arm_boost=1

[cm5]
dtoverlay=dwc2,dr_mode=host

[all]
dtparam=pciex1_gen=3

Environment: Raspberry Pi 5 8 GB (aarch64), AI HAT+, HailoRT 5.3.0, Raspberry Pi OS (Linux 6.12.62+rpt-rpi-2712)
Project: eauesque/yu_ai_manager

Michael · May 3, 2026, 12:27pm

Hi @user442 ,

Thanks - We are reviewing all the notes.

Hailo-10H on RPi5: Undocumented API findings + DFC conversion failures with Transformer-based models (SwinV2/ViT/ConvNeXt)

Hailo-10H on Raspberry Pi 5: Undocumented API findings + DFC conversion failures with Transformer-based models

What I have implemented (all using low-level hailo_platform API)

Undocumented behaviors I had to discover by trial and error

DFC conversion failures: Transformer-based models (March 2026, DFC v5.2.0)

Questions / feature requests

DFC v5.3.0 Re-test Results

What Changed in v5.3.0 — Encouraging Signs

What We Built in the Meantime

Request (unchanged from March)

Background

Benchmark Results (Pi5 + HailoRT 5.3.0)

Vision-only concurrency

GenAI + vision concurrency

hailo-ollama Coexistence (HailoRT 5.3.0)

Proposed Router Capacity Model

Additional Notes

1. temperature=0.0 rejected — undocumented breaking change in 5.3.0

2. GenAI × 2 concurrent loading confirmed (LLM + Speech2Text on same VDevice)

3. CMA is not reclaimed until full system reboot — hailo1x_pci driver bug

4. VDevice creation failure leaks CMA permanently

1. cma=1G and cma=768M silently fail — numa=fake=8 node boundary

2. dtoverlay=vc4-kms-v3d + max_framebuffers=2 consumes ~157 MB CMA at boot

3. camera_auto_detect=1 loads pisp_be and consumes additional CMA

4. Headless CMA budget with cma=512M and optimized config

Recommended minimal config.txt for headless AI HAT+ use

What I have implemented (all using low-level `hailo_platform` API)

1. `temperature=0.0` rejected — undocumented breaking change in 5.3.0

3. CMA is not reclaimed until full system reboot — `hailo1x_pci` driver bug

1. `cma=1G` and `cma=768M` silently fail — `numa=fake=8` node boundary

2. `dtoverlay=vc4-kms-v3d` + `max_framebuffers=2` consumes ~157 MB CMA at boot

3. `camera_auto_detect=1` loads `pisp_be` and consumes additional CMA

4. Headless CMA budget with `cma=512M` and optimized config

Recommended minimal `config.txt` for headless AI HAT+ use