With the following performance metrics using a batch size of 1 with an input size of 320x320x1.
ONNX float32 Intel Iris Xe iGPU: ~50FPS
uint8 HEF Hailo8L: ~35FPS
I checked the PCIE throughput and got ~13,000Mbit/s
It surprises me the Hailo8L is slower running a quantized uint8 model than the iGPU is running a float32 ONNX model.
It seems odd to me a dedicated AI accelerator is slower than an old iGPU, especially since the Hailo8L is running an uint8 quantized model and the iGPU is running a full precision float32 model.
The model does not fit into a single Hailo-8L and is therefore divided into multiple contexts. Run the following command to confirm:
hailortcli parse-hef model.hef
With multi context models you can increase the throughput at the cost of latency by running batches of images. That will reduce the context switching overhead. e.g.,
hailortcli run model.hef --batch-size 2
hailortcli run model.hef --batch-size 4
hailortcli run model.hef --batch-size 8
The number of contexts is based on the three resources (compute, memory and control) we have on the Hailo device.
You can find this information at the end of the compiler output and the profiler report.
The compiler will try this per default. However with the Performance Mode the compiler will try harder. This will require significant longer time to complete. Please check the Hailo Dataflow Compiler User Guide performance_param.