Correlation OPs/Throughput & Params/Throughput

Dear all,

I used the hailo_model_zoo object detection benchmarks to see what kind of architecture gives me the best accuracy/throughput tradeoff.

The numbers are somewhat unexpected, one would expect to have a correlation between OPs and throughput, which is not the case: (except of a couple of outliers)
Throughput = w*h*3*FPS

Also there are a lot of architectures that do not seem to perform well: For example efficientdet_lite2 only has 5 Mio Params but only 91 FPs compared to YoloV7-Tiny which has 6 Mio Params, but about 360FPS (on an even larger input size)

Is it such that:

  1. The hardware has been optimized towards e.g. yolov7-tiny, and some operations on the efficientDet arch are so costly that it will never be as efficient as yolov7-tinye
  2. The optimization on the yolov7-tiny has been throughout, and you invested more into showcasing what the hardware is capable of and with more effort efficientDet_lite2 would scale much better

It’s relevant for me to know what the first parameter of optimization is. If the first is the case, I’d had to pick the architecture based on the hailo_model_zoo results. If the 2nd is the case, i’d pick the architecture based on mAP/complexity/compatibility with trt, etc. first and invest more time into optimizing for hailo architecture.

The efficientdet_lite2 model does not fit into a single Hailo-8 and therefore is compiled into multiple contexts.
The yolov7_tiny model on the other side does fit into a single Hailo-8. This leads to a higher FPS.

I loaded both models into https://netron.app/. efficientdet_lite2 looks quite a bit longer than yolov7_tiny. This would explain why the former is compiled into multiple contexts.

If you want to understand the models in more details I would recommend you compile them using the Hailo Dataflow Compiler and create a profiler report. This will provide you many details about the resources (compute, memory and control) used, the FPS for each layer and a lot more.

Our hardware has not been optimized for specific networks.
The Hailo Dataflow Compiler will optimize each network to make use of the hardware as much as possible. You can use the Performance Param to tell the DFC to try extra hard. This will take a lot longer than usual.

@dennis.huegle I think that the truth is in the middle between the two options you presented. There are architectures that have a better fit to Hailo than others, but I would not say that Hailo is optimized only towards yolov7-tiny or other specific models.
Rather than comparing ops or params, I suggest to use the Model Explorer (Model Zoo by Hailo | AI Model Explorer to Find The Best NN Model) that demonstrates the trade-off between accuracy and throughput. You can use it to see which architectures have the best fit to Hailo. BTW, you can see that there are models that higher on this curve than yolov7-tiny.

1 Like