To quantify the above a bit, I see 486 FPS when building with “NMS engine=cpu” but only 356FPS with no NMS.
With the engine=cpu parse-hef shows:
Architecture HEF was compiled for: HAILO8
Network group name: best_ckpt, Single Context
Network name: best_ckpt/best_ckpt
Stream infos:
Input best_ckpt/input_layer1 UINT8, NHCW(1024x832x2)
Output best_ckpt/conv74_107 UINT16, NHCW(32x28x13)
Output best_ckpt/conv76_107 UINT16, NHCW(32x28x4)
Output best_ckpt/conv47_107 UINT16, NHCW(128x104x13)
Output best_ckpt/conv75_107 UINT16, NHCW(32x28x1)
Output best_ckpt/conv49_107 UINT16, NHCW(128x104x4)
Output best_ckpt/conv61_107 UINT16, NHCW(64x52x13)
Output best_ckpt/conv62_107 UINT16, NHCW(64x52x1)
Output best_ckpt/conv84_107 UINT16, NHCW(16x208x13)
Output best_ckpt/conv48_107 UINT16, NHCW(128x104x1)
Output best_ckpt/conv63_107 UINT16, NHCW(64x52x4)
Output best_ckpt/conv86_107 UINT16, NHCW(16x208x4)
Output best_ckpt/conv85_107 UINT16, NHCW(16x208x1)
VStream infos:
Input best_ckpt/input_layer1 UINT8, NHWC(1024x832x2)
Output best_ckpt/yolox_nms_postprocess FLOAT32, HAILO NMS(number of classes: 13, maximum bounding boxes per class: 120, maximum frame size: 31252)
Operation:
Op YOLOX
Name: YOLOX-Post-Process
Score threshold: 0.300
IoU threshold: 0.65
Classes: 13
Cross classes: false
Max bboxes per class: 120
Image height: 832
Image width: 1024
And with no NMS:
Architecture HEF was compiled for: HAILO8
Network group name: best_ckpt, Single Context
Network name: best_ckpt/best_ckpt
Stream infos:
Input best_ckpt/input_layer1 UINT8, NHWC(1024x832x2)
Output best_ckpt/conv74 UINT16, NHCW(32x26x13)
Output best_ckpt/conv76 UINT16, NHCW(32x26x4)
Output best_ckpt/conv61 UINT16, FCR(64x52x16)
Output best_ckpt/conv49 UINT16, NHWC(128x104x4)
Output best_ckpt/conv84 UINT16, FCR(16x208x16)
Output best_ckpt/conv47 UINT16, FCR(128x104x16)
Output best_ckpt/conv63 UINT16, NHCW(64x52x4)
Output best_ckpt/conv48 UINT16, NHCW(128x104x1)
Output best_ckpt/conv75 UINT16, NHCW(32x26x1)
Output best_ckpt/conv62 UINT16, NHCW(64x52x1)
Output best_ckpt/conv85 UINT16, NHCW(16x208x1)
Output best_ckpt/conv86 UINT16, NHCW(16x208x4)
VStream infos:
Input best_ckpt/input_layer1 UINT8, NHWC(1024x832x2)
Output best_ckpt/conv49 UINT16, NHWC(128x104x4)
Output best_ckpt/conv48 UINT16, NHWC(128x104x1)
Output best_ckpt/conv47 UINT16, FCR(128x104x13)
Output best_ckpt/conv63 UINT16, NHWC(64x52x4)
Output best_ckpt/conv62 UINT16, NHWC(64x52x1)
Output best_ckpt/conv61 UINT16, FCR(64x52x13)
Output best_ckpt/conv76 UINT16, NHWC(32x26x4)
Output best_ckpt/conv75 UINT16, NHWC(32x26x1)
Output best_ckpt/conv74 UINT16, NHWC(32x26x13)
Output best_ckpt/conv86 UINT16, NHWC(16x208x4)
Output best_ckpt/conv85 UINT16, NHWC(16x208x1)
Output best_ckpt/conv84 UINT16, FCR(16x208x13)
I.e. very different stream formats.
A workaround seems to be to use the “Stream” rather than “VStream” interface. That way I can compile the hef with engine=cpu but bypass actually using the NMS!