YOLO ONNX to Hailo HEF: End nodes, postprocessing and unsupported ops.

Kac_Zal · March 24, 2026, 4:54pm

Hi, I’m working on converting a custom YOLO-based object detection model (YOLO26m) to a Hailo.hef and ran into some confusion regarding end nodes and postprocessing.

Environment

Hailo Dataflow Compiler: 3.33.0
CUDA: 12.5
cuDNN: 9.20 (for CUDA 12.x)
NVIDIA Driver: 595
Python: 3.10
Ubuntu 22.04

Installation was done according to the documentation. GPU is available and being used.

I was advised to set the end nodes at Conv layers (detection heads).
This works, but results in multiple output tensors (bbox + class heads across scales).

However, I know that some models in the Hailo Model Zoo / Model Explorer return a single output tensor and already have postprocessed / structured outputs.

By analyzing the ONNX graph, I noticed that:

Cutting at Conv layers removes the entire postprocessing pipeline
The model actually includes operations like Sigmoid, Concat, TopK, Gather, ReduceMax, etc.

So instead, I tried setting the end node at the very end of the graph (right before the final output) at /model.23/Concat_6. This results in DFC parsing fails with errors like “GatherElements/TopK/Mod/ReduceMax operation is unsupported”.

Are these postprocessing operations (TopK, Gather, ReduceMax, etc.) generally unsupported by DFC?
If so, how are the models in Hailo Model Zoo/ Model Explorer able to produce a single, structured output tensor?
How do you implement the postprocessing? Would you be able to describe what steps are applied or share any example pipelines / reference implementations?

Any clarification or best practices would be greatly appreciated.

Thanks!

alexf · March 26, 2026, 8:18am

Great questions. TL;DR: A working example for Yolo26 (OD, pose) is due for an upcoming release in the hailo-apps repo. The light postprocessing is performed with an onnx-runtime engine on the host side, very fast on RPi. An example of the general technique exists in the C++ examples: onnxrt_hailo_pipeline at hailo-apps

Now for the mechanics. The gather/reducemax operations (&certainly sigmoid) could be implemented on Hailo chip capitalizing on the great flexibility, either on the neural core or specialized units (depending on the exact flavor), but that entails some configuration-tailoring work, which is not dev-time-efficient in this case. Similarly, implementing an equivalent postprocessing in either python/cpp can take some time to debug. The principled way to achieve guaranteed equivalence is to split the ONNX into two parts, the first corresponding to the HEF (can be verified by parsing w.o. start-node/end-node spec at all), while the second corresponding to the postprocessing. Then at inference time postprocessing can be applied by feeding the HEF outputs through onnx-engine running the 2nd-part onnx. One only needs to take care of matching the tensors and transposing axes according to respective conventions, etc.