Hailo-10H on RPi5: Undocumented API findings + DFC conversion failures with Transformer-based models (SwinV2/ViT/ConvNeXt)

Hailo-10H on Raspberry Pi 5: Undocumented API findings + DFC conversion failures with Transformer-based models

I have been developing a local AI image management application (eauesque / YU AI Manager) integrating Hailo-10H (AI HAT 2 for Raspberry Pi 5) with HailoRT v5.2.0 and DFC v5.2.0. This post shares both undocumented findings from low-level API development and specific DFC conversion failures, in hopes that Hailo engineers can provide guidance.


What I have implemented (all using low-level hailo_platform API)

All working features use pre-compiled HEF files from the official Model Zoo. I intentionally avoided hailo-apps and hailo-ollama, instead building directly on hailo_platform wheel:

  • CLIP semantic search β€” VDevice.create_infer_model() + uint8 dequantization pipeline
  • YOLO object detection β€” same InferModel API
  • LLM / VLM chat β€” hailo_platform.genai.LLM / VLM
  • Whisper speech-to-text β€” hailo_platform.genai.Speech2Text
  • VDevice exclusive-access device manager β€” automatic switching between CLIP / YOLO / LLM / VLM / S2T on a single VDevice (hailo-apps has no equivalent)
  • Multi-backend fallback β€” Hailo β†’ CoreML β†’ ONNX Runtime, transparent auto-switching
  • LAN distributed inference β€” work-stealing parallel tagging across multiple machines

Undocumented behaviors I had to discover by trial and error

All of the following were resolved through error messages and source code inspection, as no documentation existed:

  1. InferModel API is the correct API β€” The legacy VStreams API (InferVStreams, ConfigureParams.create_from_hef) returns HAILO_NOT_IMPLEMENTED on Hailo-10H. This is not documented anywhere.
  2. Output buffers must be uint8 β€” Allocating float32 buffers causes buffer size mismatch. You must allocate uint8 and dequantize afterward.
  3. input() / output() are properties, not methods β€” Inconsistent with other parts of the API.
  4. quant_info retrieval β€” infer_model.output().quant_info provides scale / zero_point for dequantization. No documentation exists for this.
  5. hailo-ollama exclusivity β€” VDevice usage requires stopping hailo-ollama first. The resulting error message does not indicate the cause clearly.

I’m sharing these in case they are useful to other developers or to Hailo for documentation improvements.


DFC conversion failures: Transformer-based models (March 2026, DFC v5.2.0)

I attempted to convert WD-Tagger models (Danbooru tag classification) from ONNX to HEF. All three failed at the parser stage, before reaching optimization:

Model Size Error Stage
wd-swinv2-tagger-v3 446 MB IndexError in _convert_axes_to_nhwc Pre-optimization
wd-vit-tagger-v3 362 MB Same Pre-optimization
wd-convnext-tagger-v3 377 MB UnsupportedShuffleLayerError Pre-optimization

500 calibration images were prepared but never reached the quantization stage.

Root cause (as I understand it): The DFC ONNX parser cannot handle LayerNormalization (multi-dimensional axis conversion) and certain Transpose patterns. These are fundamental building blocks of SwinV2, ViT, and ConvNeXt architectures β€” the majority of models developed since 2022.

I note that CLIP ViT exists in the Model Zoo as a working HEF, which suggests Hailo may have applied internal graph transformations that are not available to end users through DFC.


Questions / feature requests

  1. Is there any plan to support LayerNormalization and general Transpose patterns in DFC? These are required for essentially all Transformer-based vision models.
  2. Is an ONNX Runtime Execution Provider for Hailo-10H under consideration? This would be the most developer-friendly solution β€” eliminating the conversion step entirely. For comparison, Ryzen AI (XDNA) requires only ort.InferenceSession("model.onnx", providers=["DmlExecutionProvider"]). The absence of an equivalent for Hailo-10H is a significant barrier.
  3. Is there any workaround or additional tooling for converting SwinV2 / ViT / ConvNeXt models that is not publicly documented?

Any guidance from Hailo engineers would be greatly appreciated.


Environment: Raspberry Pi 5 (aarch64), AI HAT 2, HailoRT v5.2.0, DFC v5.2.0 (x86_64 Linux), Python 3.11
Project: GitHub - eauesque/yu_ai_manager: AI-generated image metadata manager β€” browse, search, tag and rate your Stable Diffusion / NovelAI / ComfyUI outputs. Quart + SQLite + TypeScript WebUI with Tauri desktop app. Β· GitHub

2 Likes

Subject: Follow-up: WD-Tagger DFC Conversion β€” Results on DFC v5.3.0


Hi all,

In March 2026, I posted a report on DFC conversion failures for WD-Tagger models (SwinV2, ViT, ConvNeXt) under DFC v5.2.0. I have now retested all three models under DFC v5.3.0. This is a follow-up with results, observations, and some additional findings I hope will be useful to the community.


DFC v5.3.0 Re-test Results

Model Size v5.2.0 v5.3.0 Change
wd-swinv2-tagger-v3 446 MB IndexError in _convert_axes_to_nhwc Same None
wd-vit-tagger-v3 362 MB Same Same (after onnxsim retry) Retry flow added
wd-convnext-tagger-v3 377 MB UnsupportedShuffleLayerError Same + additional UnsupportedModelError Errors increased

All three models still fail at the parser stage. The 500 calibration images prepared for quantization remain unused.


What Changed in v5.3.0 β€” Encouraging Signs

While the failures persist, v5.3.0 shows clear evidence of active work toward Transformer support:

  1. _create_layer_normalization_layer method added β€” This method did not exist in v5.2.0. DFC now explicitly attempts to handle LayerNormalization operators. The internal implementation is not yet complete β€” the call to _convert_axes_to_nhwc still raises IndexError: list index out of range β€” but the method’s presence is a strong signal that this is actively being worked on.

  2. onnxsim simplification + retry flow added β€” DFC now automatically simplifies the ONNX model and retries parsing on failure. The simplified model is saved as model.sim.onnx. The retry fails at the same point, but the infrastructure for handling difficult models has improved.

  3. End node recommendations β€” ConvNeXt now produces specific end node suggestions and prompts an interactive retry. A meaningful step forward in error recovery UX.

I read these changes as Hailo engineering actively working toward the same goal. I hope this follow-up serves as useful signal for prioritization.


What We Built in the Meantime

Since we could not wait for DFC support, we implemented alternatives and documented findings we hope are useful to other Hailo-10H developers:

1. ONNX Runtime multi-backend CLIP encoder DFC conversion being unavailable pushed us to implement a CPU/CUDA/ROCm/DirectML/OpenVINO/CoreML fallback chain using ONNX Runtime. One useful finding: the vector output is compatible with the Hailo HEF-based CLIP encoder β€” both use the same openai/clip-vit-base-patch16 base, producing 512-dimensional embeddings in the same space. Existing Hailo-built indexes and ONNX-built vectors can coexist.

2. Shared VDevice manager pattern Through trial and error, we documented the undocumented VDevice exclusivity constraint (HAILO_OUT_OF_PHYSICAL_DEVICES(74)) and built a shared singleton manager that allows multiple models (YOLO + CLIP + LLM + VLM + Speech2Text) to coexist on a single VDevice. Verified on HailoRT 5.2.0 and 5.3.0. We have written this up as a reusable pattern document.

3. HailoRT v5.2.0 β†’ v5.3.0 migration notes Key findings for anyone upgrading:

  • Device node renamed: /dev/hailort0 β†’ /dev/h1x-0 (new driver: hailo1x_pci). Python code via VDevice() is unaffected; Docker device passthrough requires updating.

  • FormatType.FLOAT32 limitation present in v5.2.0 is resolved in v5.3.0.

  • All v5.2.0 HEF files load and run correctly under v5.3.0 runtime (7 models verified on Raspberry Pi 5 + AI HAT 2).

  • SegmentInfo attributes renamed to start_sec / end_sec / text.

  • numpy < 2 constraint removed in v5.3.0.

These documents are available in our repository: eauesque/yu_ai_manager


Request (unchanged from March)

The core request stands:

  1. Fix _convert_axes_to_nhwc for multi-dimensional LayerNormalization β€” the method is now being called, the axis mapping just needs to handle non-NCHW inputs correctly.

  2. ONNX Runtime Execution Provider for Hailo-10H β€” this would make DFC optional and resolve the issue structurally for the entire post-2022 vision model ecosystem.

We will re-test again when the next DFC release is available and post another follow-up. We are rooting for this to land.


Environment: WSL2 Ubuntu, AMD Ryzen 5 5600X, DFC v5.3.0 Models: SmilingWolf/wd-{swinv2,vit,convnext}-tagger-v3 (HuggingFace)

1 Like

Subject: Hailo-10H Multi-Model Coexistence: Benchmarks, VDevice Sharing Pattern, and hailo-ollama Integration (HailoRT 5.3.0)


Hi all,

I want to share practical findings from running multiple models concurrently on a single Hailo-10H (Raspberry Pi 5 + AI HAT 2), including measured latency/throughput numbers and a reusable VDevice sharing pattern. All results are from HailoRT 5.3.0 on actual hardware.


Background

When building an application that uses YOLO, CLIP, LLM, VLM, and Speech2Text on the same device, the first obstacle is HAILO_OUT_OF_PHYSICAL_DEVICES(74). The constraint is real β€” one physical device, one VDevice per process β€” but it is workable once you understand the rules.

The common failure modes we hit:

  • Background preloader threads racing to create separate VDevice() instances

  • is_available() checks that destructively create and abandon a VDevice (GC timing makes this unreliable)

  • Model switching via del self.vd without calling vd.release() explicitly

  • Independent modules each calling VDevice() without coordination

The solution is a shared singleton VDevice manager with owner-based access. We have written this up as a reusable pattern with full code: VDEVICE_SHARING_PATTERN.md

Key points:

  • One VDevice per process, shared by all models

  • Each consumer registers with an owner name ("yolo", "clip", "llm", etc.)

  • is_available() must never create a VDevice β€” check import hailo_platform only

  • Call vd.release() explicitly on shutdown; del alone is not sufficient

  • Use VDevice.create_params().group_id to share across processes


Benchmark Results (Pi5 + HailoRT 5.3.0)

Vision-only concurrency

50 iterations each, models: CLIP image encoder + YOLOv8n + CLIP text encoder

Scenario Model Median p95 Throughput Slowdown
Solo clip_image 18.7 ms 19.0 ms 53.2/s Γ—1.00
Solo yolo 14.8 ms 15.8 ms 66.7/s Γ—1.00
2-way parallel clip_image 23.9 ms 24.7 ms 41.9/s Γ—1.27
2-way parallel yolo 23.8 ms 24.9 ms 41.7/s Γ—1.60
3-way parallel clip_image 46.9 ms 47.1 ms 21.4/s Γ—2.49
3-way parallel clip_text 46.9 ms 47.0 ms 21.6/s β€”
3-way parallel yolo 46.8 ms 47.4 ms 21.6/s Γ—3.09

Combined throughput:

Parallelism Combined thr Efficiency
1 60/s 100%
2 (CLIP + YOLO) 82.8/s 69%
3 (CLIP + text + YOLO) 64.0/s 46%

Observation: HailoRT scheduler applies strict equal time-slicing. p95 β‰ˆ median (within 1 ms), confirming deterministic round-robin. The formula latency(N) = N Γ— solo_latency holds reliably for vision-only workloads, which makes router capacity planning straightforward arithmetic.

2-way parallel is practical (69% efficiency). 3-way and beyond shows diminishing returns β€” combined throughput actually drops below 2-way, so offloading to external CPU/GPU becomes preferable at that point.

GenAI + vision concurrency

hailo-ollama (qwen2.5:1.5b) running LLM generation while yu_ai_manager runs CLIP image encoding concurrently:

Metric Value
CLIP solo median 18.7 ms
CLIP under LLM load median 152.0 ms
CLIP slowdown Γ—8.08
LLM throughput (under CLIP load) ~5.6 tok/s

The equal time-slicing model breaks down when GenAI is involved. LLM takes a disproportionately large scheduler slice. Any SLO tighter than ~200 ms for vision tasks is violated when LLM is active. Applications with strict vision latency requirements should queue or fallback to external GPU/CPU when GenAI is running.


hailo-ollama Coexistence (HailoRT 5.3.0)

Cross-process VDevice sharing with hailo-ollama works cleanly on 5.3.0:

bash

HAILO_OLLAMA_VDEVICE_GROUP_ID=MY_APP_SHARED \
OLLAMA_HOST=127.0.0.1:18765 \
/usr/bin/hailo-ollama

Set the same group ID on the application side:

bash

HAILO_VDEVICE_GROUP_ID=MY_APP_SHARED python your_app.py

lsof /dev/h1x-0 confirms both processes hold the fd simultaneously, with the HailoRT scheduler time-slicing between them. Note: use the system package /usr/bin/hailo-ollama (HailoRT 5.3.0 linked), not a user-built binary from 5.2.0.


Proposed Router Capacity Model

Based on these measurements, a simple capacity model for routing inference requests:

predict_latency(new_request):
    if active_genai_count > 0:
        β†’ reject or queue (SLO violation likely)
    elif active_vision_count >= 2:
        β†’ offload to external CPU/GPU
    else:
        β†’ estimated_latency = solo_latency Γ— (active_vision_count + 1)

max_parallel_hailo_vision = 2 is our recommended parameter for production use.


Additional Notes

  • All v5.2.0 HEF files load correctly under HailoRT 5.3.0 runtime (7 models verified)

  • Device node renamed: /dev/hailort0 β†’ /dev/h1x-0 (driver: hailo1x_pci). Python via VDevice() is unaffected; Docker device passthrough requires updating

  • FormatType.FLOAT32 limitation present in 5.2.0 is resolved in 5.3.0

  • Full migration notes: HAILORT_5_3_0_MIGRATION.md

1 Like

Hi @user876 ,

Thanks for the very detailed posts.
We will review them carefully.

Thanks,

Hello Michael, thank you for your response. I have since investigated the scheduler priority API in detail and would like to share the results.

Following my earlier post on undocumented API behaviors and DFC conversion failures, I have been investigating multi-task scheduling on a single VDevice (Hailo-10H, HailoRT v5.3.0, Raspberry Pi 5).

This post shares benchmark results for ConfiguredInferModel.set_scheduler_priority(), set_scheduler_threshold(), and set_scheduler_timeout() under real concurrent workloads.

-–

### Environment

  • Hardware: Raspberry Pi 5 (aarch64), AI HAT 2 (Hailo-10H)

  • HailoRT: v5.3.0

  • Python: 3.11

  • Models: CLIP (vision) + Qwen2.5-1.5B (LLM via hailo-ollama or hailo_platform.genai)

  • Project: https://github.com/eauesque/yu_ai_manager

-–

### Findings: valid priority range

Through trial and error, I found that the valid range for set_scheduler_priority() is approximately **0–10**.

  • 0, 1, 10: accepted

  • 100, 255: rejected with invalid argument error

This range is not documented anywhere I could find.

-–

### Scenario 1: In-process Hailo GenAI LLM (hailo_platform.genai.LLM) + CLIP

Scenario median p95 mean slowdown
solo (no LLM) 18.9 ms 19.1 ms 18.9 ms Γ—1.00
LLM overlap + prio=0 18.8 ms 21.4 ms 18.9 ms Γ—1.00
LLM overlap + prio=10 18.9 ms 21.4 ms 19.0 ms Γ—1.00
LLM + prio=10, threshold=2, timeout=1ms 20.0 ms 22.2 ms 20.1 ms Γ—1.06

**Observation:** In-process LLM contention causes almost no CLIP latency degradation. Priority settings had no measurable effect; threshold=2 + timeout=1ms slightly worsened latency.

Note: hailo_platform.genai.LLM exposes no scheduler/priority API, so only the vision-side ConfiguredInferModel could be tuned.

-–

### Scenario 2: hailo-ollama (separate process) + CLIP

Scenario median p95 mean slowdown
solo (no ollama) 18.8 ms 18.9 ms 18.7 ms Γ—1.00
ollama active + prio=0 152.1 ms 155.1 ms 152.4 ms Γ—8.14
ollama active + prio=10 152.1 ms 153.9 ms 152.2 ms Γ—8.13
ollama + prio=10, threshold=2, timeout=1ms 151.9 ms 153.4 ms 106.1 ms Γ—5.66

**Observation:** hailo-ollama running in a separate process causes an **Γ—8 CLIP latency degradation** that set_scheduler_priority() cannot mitigate. The difference between prio=0 and prio=10 is 0.1% β€” effectively zero. threshold + timeout reduced the mean slightly but left p50/p95 unchanged, suggesting it trimmed some outlier iterations without improving the structural contention.

-–

### Conclusion / architectural implication

Based on these results, the scheduler priority API is not an effective tool for protecting latency-sensitive vision inference when hailo-ollama is running in a separate process. The Γ—8 slowdown appears to stem from VDevice arbitration at a level below what ConfiguredInferModel scheduler settings can reach.

For my router implementation, I have adopted the following policy: **do not run latency-sensitive vision tasks concurrently with hailo-ollama; use queue or fallback instead.** This works for my use case, but it means vision and LLM inference are effectively serialized.

-–

### Questions

  1. Is the Γ—8 slowdown from hailo-ollama contention a known limitation of HailoRT v5.3.0, and is it expected to improve in v5.4.0?

  2. Is there a recommended way to share a VDevice between a ConfiguredInferModel (vision) and hailo-ollama (LLM) with latency isolation?

  3. Is there any priority or QoS mechanism that operates at the hailo-ollama level, or between separate processes sharing a VDevice?

Any guidance would be appreciated.

-–

*Reproduction scripts:*

  • tests/hailo_router_baseline/router_verify_test4_priority.py
  • tests/hailo_router_baseline/router_verify_test4b_ollama_priority.py