There are multiple points that will affect the speed of your model.
Application
The Hailo architecture allows processing a new image a soon as the previous image has finished processing in the first layer. If your application is waiting for the result before sending the next image you will not get the highest throughput. This is also true for some frameworks like ONNX runtime.
The hailortcli command allows you to run a network without your application and measure the FPS of the model.
hailortcli run model.hef
Have a look at the multiple application examples available in the Developer Zone and Hailo GitHub repository that show how to run streaming inference.
https://github.com/hailo-ai/Hailo-Application-Code-Examples
Model was compiled to multiple contexts.
When a model exceeds the resources available on the Hailo hardware, the Hailo Dataflow Compiler will compile the model into multiple contexts. The model will then be executed step-by-step by switching the parts of the model. This will take some additional time and therefore the model will run slower.
You can confirm this by running the following command
hailortcli parse-hef model.hef
Here you can see part of the output for the yolov7_tiny model compiled for Hailo-8 (single context) and Hailo-8L (multi context).
Architecture HEF was compiled for: HAILO8
Network group name: yolov7_tiny, Single Context
Network name: yolov7_tiny/yolov7_tiny
Architecture HEF was compiled for: HAILO8L
Network group name: yolov7_tiny, Multi Context - Number of contexts: 3
Network name: yolov7_tiny/yolov7_tiny
When your model is compiled in multiple contexts you can increase the throughput by sending multiple images using the batch-size parameter. This will increase latency but will give you a higher throughput.
Running streaming inference (HEF_8L/yolov7_tiny.hef):
Transform data: true
Type: auto
Quantized: true
Network yolov7_tiny/yolov7_tiny: 100% | 476 | FPS: 95.18 | --batch-size 1
Network yolov7_tiny/yolov7_tiny: 100% | 644 | FPS: 128.63 | --batch-size 2
Network yolov7_tiny/yolov7_tiny: 100% | 773 | FPS: 154.38 | --batch-size 4
Network yolov7_tiny/yolov7_tiny: 100% | 860 | FPS: 171.75 | --batch-size 8
Some models may allow you to squeeze the model into a single context by quantizing some layers to 4-bit by adding command to the ALLS script. See the Hailo Dataflow Compiler documentation for more details.
PCIe bandwidth
Depending on your model the PCIe interface (PCIe gen and number of lanes) will play a more or less significant role in the throughput you can achieve. Especially models compiled to multiple contexts will benefit from high PCIe bandwidth.
Use the sudo lspci -vvv command to verify the numbers of lanes available.
04:00.0 Co-processor: Device 1e60:2864 (rev 01)
Subsystem: Device 1e60:2864
..
LnkSta: Speed 8GT/s (ok), Width x4 (ok)
..
Kernel driver in use: hailo
Kernel modules: hailo_pci
You can also use the hailo-integration-tool to measure the PCIe throughput and confirm the PCIe configuration. The tool is available in the Hailo Developer Zone.
Host CPU
The host CPU will execute part of the pipeline pre-processing of images and post-processing of the result. This can require significant resources especially with weaker host CPUs.
If available use the post-processing provided by Hailo for some popular models like the YOLO family. This can be added during the model conversion using nms_postprocess ALLS command.
If available make use of SoC peripherals e.g. for video decode to save main CPU cycles.