Error when converting .onnx to .hef

i’m using DFC to convert a .onnx model trained on a custom dataset into .hef to run on hailo8 hat devices. my system information:

  • CPU: Xeon E3-1270V3
  • RAM: 16GB
  • GPU: NVIDIA GTX 1060 6GB
  • GPU Driver version: 575.64.03
  • CUDA version: 12.5.1
  • CUDNN version: 9.10
  • Hailo DFC version: 3.32.0
  • Hailo Model Zoo version: 2.16

the command that i use: hailomz compile yolov11s --ckpt=yolov11s-vehicles.onnx --hw-arch hailo8 --calib-path train/images --classes 4 --performance

the output of the command:
[info] No GPU chosen, Selected GPU 0
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1755710630.788529 6570 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1755710630.792892 6570 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
In file included from /home/hyann/hailodfc/lib/python3.10/site-packages/numpy/core/include/numpy/ndarraytypes.h:1929,
from /home/hyann/hailodfc/lib/python3.10/site-packages/numpy/core/include/numpy/ndarrayobject.h:12,
from /home/hyann/hailodfc/lib/python3.10/site-packages/numpy/core/include/numpy/arrayobject.h:5,
from /home/hyann/.pyxbld/temp.linux-x86_64-3.10/home/hyann/vehicle-detection-model-training/hailo_model_zoo/hailo_model_zoo/core/postprocessing/cython_utils/cython_nms.c:1145:
/home/hyann/hailodfc/lib/python3.10/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: #warning "Using deprecated NumPy API, disable it with " “#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION” [-Wcpp]
17 | #warning "Using deprecated NumPy API, disable it with "
| ^~~~~~~
Start run for network yolov11s …
Initializing the hailo8 runner…
[info] Translation started on ONNX model yolov11s
[info] Restored ONNX model yolov11s (completion time: 00:00:00.30)
[info] Extracted ONNXRuntime meta-data for Hailo model (completion time: 00:00:00.79)
[info] NMS structure of yolov8 (or equivalent architecture) was detected.
[info] In order to use HailoRT post-processing capabilities, these end node names should be used: /model.23/cv3.0/cv3.0.2/Conv /model.23/cv2.0/cv2.0.2/Conv /model.23/cv3.1/cv3.1.2/Conv /model.23/cv2.1/cv2.1.2/Conv /model.23/cv2.2/cv2.2.2/Conv /model.23/cv3.2/cv3.2.2/Conv.
[info] Start nodes mapped from original model: ‘images’: ‘yolov11s/input_layer1’.
[info] End nodes mapped from original model: ‘/model.23/cv2.0/cv2.0.2/Conv’, ‘/model.23/cv3.0/cv3.0.2/Conv’, ‘/model.23/cv2.1/cv2.1.2/Conv’, ‘/model.23/cv3.1/cv3.1.2/Conv’, ‘/model.23/cv2.2/cv2.2.2/Conv’, ‘/model.23/cv3.2/cv3.2.2/Conv’.
[info] Translation completed on ONNX model yolov11s (completion time: 00:00:01.70)
[info] Appending model script commands to yolov11s from string
[info] Added nms postprocess command to model script.
[info] Saved HAR to: /home/hyann/vehicle-detection-model-training/yolov11s.har
Using generic alls script found in /home/hyann/vehicle-detection-model-training/hailo_model_zoo/hailo_model_zoo/cfg/alls/generic/yolov11s.alls because there is no specific hardware alls
Preparing calibration data…
[info] Loading model script commands to yolov11s from /home/hyann/vehicle-detection-model-training/hailo_model_zoo/hailo_model_zoo/cfg/alls/generic/yolov11s.alls
[info] Loading model script commands to yolov11s from string
[info] Found model with 3 input channels, using real RGB images for calibration instead of sampling random data.
[info] Starting Model Optimization
I0000 00:00:1755710644.504764 6570 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5242 MB memory: → device: 0, name: NVIDIA GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1
I0000 00:00:1755710644.709806 6570 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5242 MB memory: → device: 0, name: NVIDIA GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1
[info] Using default optimization level of 2
[info] Model received quantization params from the hn
I0000 00:00:1755710646.735036 6570 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5242 MB memory: → device: 0, name: NVIDIA GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1
I0000 00:00:1755710646.952869 6570 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5242 MB memory: → device: 0, name: NVIDIA GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1
[info] MatmulDecompose skipped
[info] Starting Mixed Precision
[info] Model Optimization Algorithm Mixed Precision is done (completion time is 00:00:00.90)
[info] LayerNorm Decomposition skipped
[info] Starting Statistics Collector
[info] Using dataset with 64 entries for calibration
Calibration: 0%| | 0/64 [00:00<?, ?entries/s]I0000 00:00:1755710690.993486 6728 cuda_dnn.cc:529] Loaded cuDNN version 91200
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
W0000 00:00:1755710691.291733 6728 conv_ops_gpu.cc:330] None of the algorithms provided by cuDNN frontend heuristics worked; trying fallback algorithms. Conv: batch: 8
in_depths: 3
out_depths: 32
in: 641
in: 641
data_format: 1
filter: 3
filter: 3
filter: 3
dilation: 1
dilation: 1
stride: 2
stride: 2
padding: 0
padding: 0
dtype: DT_FLOAT
group_count: 1
device_identifier: “sm_6.1 with 6357188608B RAM, 10 cores, 1784500KHz clock, 4004000KHz mem clock, 1572864B L2$”
version: 3

Traceback (most recent call last):
………..

2 root error(s) found.
(0) NOT_FOUND: No algorithm worked! Error messages:
Profiling failure on CUDNN engine eng1{}: UNKNOWN: CUDNN_STATUS_EXECUTION_FAILED
in external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc(6121): ‘status’
Profiling failure on CUDNN engine eng28{}: UNKNOWN: CUDNN_STATUS_EXECUTION_FAILED
in external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc(6121): ‘status’
Profiling failure on CUDNN engine eng0{}: UNKNOWN: CUDNN_STATUS_EXECUTION_FAILED
in external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc(6121): ‘status’
[[{{node yolov11s_1/conv1_1/conv_op_1/Conv2D}}]]
[[yolov11s_1/ew_sub_softmax1_1/act_op_1/StatefulPartitionedCall/critical_section_execute/StatefulPartitionedCall/cond_2/pivot_f/_50/_35]]
(1) NOT_FOUND: No algorithm worked! Error messages:
Profiling failure on CUDNN engine eng1{}: UNKNOWN: CUDNN_STATUS_EXECUTION_FAILED
in external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc(6121): ‘status’
Profiling failure on CUDNN engine eng28{}: UNKNOWN: CUDNN_STATUS_EXECUTION_FAILED
in external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc(6121): ‘status’
Profiling failure on CUDNN engine eng0{}: UNKNOWN: CUDNN_STATUS_EXECUTION_FAILED
in external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc(6121): ‘status’
[[{{node yolov11s_1/conv1_1/conv_op_1/Conv2D}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_one_step_on_data_distributed_162809]

do anyone have any idea why this error is happening, please help, many thanks

Hey @Long_Nguyen,

Welcome to the Hailo Community!

I see you’re encountering a TensorFlow/cuDNN kernel selection failure during the calibration/optimization stage. Looking at your log, there are a couple of key issues:

  • cuDNN is testing multiple convolution algorithms but failing with “No algorithm worked! … CUDNN_STATUS_EXECUTION_FAILED” on a 3→32 convolution
  • Your convolution input appears to be 641×641 pixels. YOLOv8/11 models typically expect inputs that are multiples of 32 (like 640×640), and these odd dimensions can cause cuDNN algorithm selection problems

Try running the stages separately for better debugging control:

# 1) Parse → HAR
hailomz parse yolov11s --ckpt yolov11s-vehicles.onnx --hw-arch hailo8 --output yolov11s_unopt.har

# 2) Optimize (calibrate/quantize) → optimized HAR
# Note: For Pascal GPU + cuDNN 9, consider CPU-only calibration to avoid cuDNN issues
CUDA_VISIBLE_DEVICES="" \
hailomz optimize yolov11s \
  --har yolov11s_unopt.har \
  --calib-path train/images \
  --classes 4 \
  --calib-batch-size 1 \
  --input-shape 1x640x640x3 \
  --output yolov11s_opt.har

# 3) Compile → HEF
hailomz compile yolov11s --har yolov11s_opt.har --hw-arch hailo8 --output yolov11s.hef

This follows the “parse first, compile last” workflow, using the HAR file as the intermediate format between stages. The approach is better for controling the compilation.

Hope this helps!

1 Like

Thanks for the reply, the conversion is done successfully after I follow your steps. But still, the compatibility between Hailo DFC and NVIDIA driver kit is not great. After I got the error above, I restarted the computer, then hailomz didn’t recognize my GPU anymore and had to fallback to CPU.
(hailodfc) hyann@hyann-MS-7817:~/vehicle-detection-model-training$ hailomz compile yolov11s --har yolov11s.har --hw-arch hailo8
[info] No GPU chosen and no suitable GPU found, falling back to CPU.
Is there any long-term fix for this problem?

Updated R&D with this , hopefully will have a solution for this in next release!

how are you using the gpu. currently i am running optimization of yolo model on linux OS that is present on camera along with hailo accelerator chip.

I do have a 4090 on my local desktop, can i quantize model from .onnx into .hef on my local system, (that is not connected with hailo hardware) and then send the hef file over to my system?