Unwanted auto_reshapes when not using Hailo for NMS

Neil_Murphy · September 5, 2025, 1:15pm

Hello Hailo Community!

I’m currently running a yolox model with NMS engine=cpu. It works very well

I’m now considering moving the NMS into my own post-processing code. However, I find that if I remove the “nms_postprocess” command from my compile script then the FPS drops. This seems to be because the tools are now adding a bunch of format conversion layers that weren’t present before. I see these new lines appear in my “.auto.alls” file:

auto_reshape_from_input_layer1_to_space_to_depth1 = format_conversion(input_layer1, space_to_depth1, tf_rgb_to_hailo_rgb)
auto_reshape_from_conv54_to_output_layer2 = format_conversion(conv54, output_layer2, hailo_rgb_to_tf_rgb)
auto_reshape_from_conv56_to_output_layer1 = format_conversion(conv56, output_layer1, hailo_rgb_to_tf_rgb)
auto_reshape_from_conv68_to_output_layer5 = format_conversion(conv68, output_layer5, hailo_rgb_to_tf_rgb)
auto_reshape_from_conv55_to_output_layer3 = format_conversion(conv55, output_layer3, hailo_rgb_to_tf_rgb)
auto_reshape_from_conv70_to_output_layer4 = format_conversion(conv70, output_layer4, hailo_rgb_to_tf_rgb)

Is there any way to turn off NMS without getting this weird reshaping behaviour?

Many thanks!

Neil_Murphy · September 8, 2025, 1:40pm

To quantify the above a bit, I see 486 FPS when building with “NMS engine=cpu” but only 356FPS with no NMS.

With the engine=cpu parse-hef shows:

Architecture HEF was compiled for: HAILO8
Network group name: best_ckpt, Single Context
    Network name: best_ckpt/best_ckpt
        Stream infos:
            Input  best_ckpt/input_layer1 UINT8, NHCW(1024x832x2)
            Output best_ckpt/conv74_107 UINT16, NHCW(32x28x13)
            Output best_ckpt/conv76_107 UINT16, NHCW(32x28x4)
            Output best_ckpt/conv47_107 UINT16, NHCW(128x104x13)
            Output best_ckpt/conv75_107 UINT16, NHCW(32x28x1)
            Output best_ckpt/conv49_107 UINT16, NHCW(128x104x4)
            Output best_ckpt/conv61_107 UINT16, NHCW(64x52x13)
            Output best_ckpt/conv62_107 UINT16, NHCW(64x52x1)
            Output best_ckpt/conv84_107 UINT16, NHCW(16x208x13)
            Output best_ckpt/conv48_107 UINT16, NHCW(128x104x1)
            Output best_ckpt/conv63_107 UINT16, NHCW(64x52x4)
            Output best_ckpt/conv86_107 UINT16, NHCW(16x208x4)
            Output best_ckpt/conv85_107 UINT16, NHCW(16x208x1)
        VStream infos:
            Input  best_ckpt/input_layer1 UINT8, NHWC(1024x832x2)
            Output best_ckpt/yolox_nms_postprocess FLOAT32, HAILO NMS(number of classes: 13, maximum bounding boxes per class: 120, maximum frame size: 31252)
            Operation:
                Op YOLOX
                Name: YOLOX-Post-Process
                Score threshold: 0.300
                IoU threshold: 0.65
                Classes: 13
                Cross classes: false
                Max bboxes per class: 120
                Image height: 832
                Image width: 1024

And with no NMS:

Architecture HEF was compiled for: HAILO8
Network group name: best_ckpt, Single Context
    Network name: best_ckpt/best_ckpt
        Stream infos:
            Input  best_ckpt/input_layer1 UINT8, NHWC(1024x832x2)
            Output best_ckpt/conv74 UINT16, NHCW(32x26x13)
            Output best_ckpt/conv76 UINT16, NHCW(32x26x4)
            Output best_ckpt/conv61 UINT16, FCR(64x52x16)
            Output best_ckpt/conv49 UINT16, NHWC(128x104x4)
            Output best_ckpt/conv84 UINT16, FCR(16x208x16)
            Output best_ckpt/conv47 UINT16, FCR(128x104x16)
            Output best_ckpt/conv63 UINT16, NHCW(64x52x4)
            Output best_ckpt/conv48 UINT16, NHCW(128x104x1)
            Output best_ckpt/conv75 UINT16, NHCW(32x26x1)
            Output best_ckpt/conv62 UINT16, NHCW(64x52x1)
            Output best_ckpt/conv85 UINT16, NHCW(16x208x1)
            Output best_ckpt/conv86 UINT16, NHCW(16x208x4)
        VStream infos:
            Input  best_ckpt/input_layer1 UINT8, NHWC(1024x832x2)
            Output best_ckpt/conv49 UINT16, NHWC(128x104x4)
            Output best_ckpt/conv48 UINT16, NHWC(128x104x1)
            Output best_ckpt/conv47 UINT16, FCR(128x104x13)
            Output best_ckpt/conv63 UINT16, NHWC(64x52x4)
            Output best_ckpt/conv62 UINT16, NHWC(64x52x1)
            Output best_ckpt/conv61 UINT16, FCR(64x52x13)
            Output best_ckpt/conv76 UINT16, NHWC(32x26x4)
            Output best_ckpt/conv75 UINT16, NHWC(32x26x1)
            Output best_ckpt/conv74 UINT16, NHWC(32x26x13)
            Output best_ckpt/conv86 UINT16, NHWC(16x208x4)
            Output best_ckpt/conv85 UINT16, NHWC(16x208x1)
            Output best_ckpt/conv84 UINT16, FCR(16x208x13)

I.e. very different stream formats.

A workaround seems to be to use the “Stream” rather than “VStream” interface. That way I can compile the hef with engine=cpu but bypass actually using the NMS!

omria · September 10, 2025, 2:35pm

Hey @Neil_Murphy,

You’re correct in your observation. Removing nms_postprocess from the compile script causes the compiler to insert format conversion layers, resulting in the performance degradation you’re experiencing.

Root Cause:
Hailo’s NMS post-processing handles more than just non-maximum suppression - it also manages tensor format alignment efficiently. Without it, the compiler must export raw detection heads with mixed layouts (NHWC, NHCW, FCR, etc.), which triggers the insertion of format conversion layers you’re observing.

Compile with NMS, consume raw outputs at runtime
- Compile using nms_postprocess(engine=cpu)
- Access raw detection head outputs via the Stream interface rather than VStream
- This prevents format conversion insertion while maintaining custom post-processing flexibility
Use bbox decoding without NMS
- Compile with nms_postprocess(..., bbox_decoding_only=True)
- Provides decoded proposals (anchors applied, scores filtered) without IoU suppression
- Implement your own NMS logic on the host side
Stream API approach
- Utilize Stream outputs to access raw tensor data without forced memory layout conversions
- Avoids the reshaping overhead introduced by VStream

Hope this helps!

Neil_Murphy · September 10, 2025, 4:18pm

Hi Omria,

I’ve already implemented the compile-with-NMS-but-use-Streams solutions and its working nicely!

Incidentally I don’t think your solution #2 would work for me. I’m using different strides in X and Y directions for one of my heads and from what I’ve seen the Hailo bounding box code assumes the same stride in both X and Y dimensions.

Many thanks for your help.