How to Parse YOLOv12 Detector Outputs on Hailo?

## Context

I’m working with a YOLOv12 detector model (`detector_v10_m5.hef`) compiled for Hailo. The model has 3 outputs:

- `detector_v10_m5/conv11`: shape `(1, 80, 80, 256)`, dtype `uint8`

- `detector_v10_m5/conv13`: shape `(1, 40, 40, 128)`, dtype `uint8`

- `detector_v10_m5/format_conversion2`: shape `(1, 1, 384, 384)`, dtype `uint8`

## Goal

Parse these outputs to extract bounding boxes `[x1, y1, x2, y2]` and confidence scores for object detection.

## What I’ve Tried

1. **Grid-based parsing** of `conv11`/`conv13`: Treating them as grid feature maps with 5 channels per anchor (x, y, w, h, obj), but the channel counts (256, 128) don’t divide evenly by 5, and I’m unsure of the correct decoding formula for YOLOv12.

2. **format_conversion2 as decoded output**: Tried reshaping `(1, 1, 384, 384)` = 147,456 values into `(N, 5)` or `(N, 6)` format, but this produces 24,000+ detections which seems wrong.

## Questions

1. **What is the correct output format for YOLOv12 on Hailo?** Are `conv11`/`conv13` raw grid outputs that need manual decoding, or is there a decoded output I should use?

2. **What does `format_conversion2` contain?** Is it decoded detections, intermediate features, or something else?

3. **What is the correct decoding formula?** For anchor-free YOLOv12, how should I convert the grid cell outputs to bounding box coordinates? Is there documentation or example code for YOLOv12 post-processing?

4. **Channel organization**: With 256 channels on an 80x80 grid and 128 channels on a 40x40 grid, how are these organized? Is it multiple anchors, or a different structure? I am assuming it is anchor free, but?

## Model Details

- Model: YOLOv12 (detector_v10_m5)

- Input: 640x640 RGB

- Outputs: See above

- Single-class detection (fish)

Any guidance, documentation links, or example code would be greatly appreciated!

1 Like

I’m facing the same issue:

ValueError: Dimension 1 in both shapes must be equal, but are 40 and 80. Shapes are [8,40,40,64] and [8,80,80,128].

 From merging shape 0 with other shapes. for '{{node Postprocessor/transpose/a}} = Pack[N=2, T=DT_FLOAT, axis=0](endnodes, endnodes_1)' with input shapes: [8,40,40,64], [8,80,80,128].

How can I update my network yaml to solve it?

Hi @Justin_Olsson ,

Your model was compiled without on-chip NMS, and the output tensors (256ch, 128ch, 384×384) are intermediate feature maps - not detection head outputs. That’s why they don’t decode cleanly into bounding boxes.

Can you please try to recompile your HEF with on-chip NMS enabled. This is the standard approach for all YOLO models in the Hailo Model Zoo. YOLOv12 uses the same detection head as YOLOv8, so in your compilation config, set:

postprocessing:
  meta_arch: nanodet_v8
  device_pre_post_layers:
    nms: true
    sigmoid: true
  anchors:
    regression_length: 15
    strides: [8, 16, 32]

With NMS on-chip, you’ll get a single output tensor with ready-to-use [y_min, x_min, y_max, x_max, score, class_id] per detection - no manual decoding needed.

  1. conv11/conv13 are backbone/neck feature maps, not detection outputs. Standard YOLOv12 detection head outputs have 64 + num_classes channels per stride (65 for single-class).
  2. format_conversion2 is a compiler-inserted data layout reshaping layer - it contains no meaningful detection data.
  3. The correct post-processing is DFL (Distribution Focal Loss) anchor-free decoding (same as YOLOv8), but you shouldn’t need to implement it manually if you compile with on-chip NMS.

Thanks,

1 Like

Hi @user821,

This error means compilation config is trying to pack outputs from different detection scales (stride 8 → 80×80 and stride 16 → 40×40) into a single tensor, which fails due to shape mismatch.

The fix is to make sure your network YAML uses nanodet_v8 as the meta-architecture, which correctly handles multi-scale outputs. postprocessing section should look like:

postprocessing:
  meta_arch: nanodet_v8
  anchors:
    regression_length: 15
    strides: [8, 16, 32]
    scale_factors: [0.5, 0.5]
  device_pre_post_layers:
    nms: true
    sigmoid: true
  nms_iou_thresh: 0.7
  score_threshold: 0.001
  nms_max_output_per_class: 300

The nanodet_v8 post-processor handles concatenation of the 3 stride levels internally before NMS.
Please Make sure each detection head output is listed as a separate end node.
If you have a custom .alls file, remove any reshape or concat directives that try to merge outputs from different strides.

Thanks,

1 Like