My model runs slower than expected

There are multiple points that will affect the speed of your model.

Application

The Hailo architecture allows processing a new image a soon as the previous image has finished processing in the first layer. If your application is waiting for the result before sending the next image you will not get the highest throughput. This is also true for some frameworks like ONNX runtime.
The hailortcli command allows you to run a network without your application and measure the FPS of the model.

hailortcli run model.hef

Have a look at the multiple application examples available in the Developer Zone and Hailo GitHub repository that show how to run streaming inference.

https://github.com/hailo-ai/Hailo-Application-Code-Examples

Model was compiled to multiple contexts.

When a model exceeds the resources available on the Hailo hardware, the Hailo Dataflow Compiler will compile the model into multiple contexts. The model will then be executed step-by-step by switching the parts of the model. This will take some additional time and therefore the model will run slower.
You can confirm this by running the following command

hailortcli parse-hef model.hef

Here you can see part of the output for the yolov7_tiny model compiled for Hailo-8 (single context) and Hailo-8L (multi context).

Architecture HEF was compiled for: HAILO8
Network group name: yolov7_tiny, Single Context
    Network name: yolov7_tiny/yolov7_tiny
Architecture HEF was compiled for: HAILO8L
Network group name: yolov7_tiny, Multi Context - Number of contexts: 3
    Network name: yolov7_tiny/yolov7_tiny

When your model is compiled in multiple contexts you can increase the throughput by sending multiple images using the batch-size parameter. This will increase latency but will give you a higher throughput.

Running streaming inference (HEF_8L/yolov7_tiny.hef):
  Transform data: true
    Type:      auto
    Quantized: true
Network yolov7_tiny/yolov7_tiny: 100% | 476 | FPS: 95.18 | --batch-size 1
Network yolov7_tiny/yolov7_tiny: 100% | 644 | FPS: 128.63 | --batch-size 2
Network yolov7_tiny/yolov7_tiny: 100% | 773 | FPS: 154.38 | --batch-size 4
Network yolov7_tiny/yolov7_tiny: 100% | 860 | FPS: 171.75 | --batch-size 8

Some models may allow you to squeeze the model into a single context by quantizing some layers to 4-bit by adding command to the ALLS script. See the Hailo Dataflow Compiler documentation for more details.

PCIe bandwidth

Depending on your model the PCIe interface (PCIe gen and number of lanes) will play a more or less significant role in the throughput you can achieve. Especially models compiled to multiple contexts will benefit from high PCIe bandwidth.
Use the sudo lspci -vvv command to verify the numbers of lanes available.

04:00.0 Co-processor: Device 1e60:2864 (rev 01)
	Subsystem: Device 1e60:2864
..
		LnkSta:	Speed 8GT/s (ok), Width x4 (ok)
..
	Kernel driver in use: hailo
	Kernel modules: hailo_pci

You can also use the hailo-integration-tool to measure the PCIe throughput and confirm the PCIe configuration. The tool is available in the Hailo Developer Zone.

Host CPU
The host CPU will execute part of the pipeline pre-processing of images and post-processing of the result. This can require significant resources especially with weaker host CPUs.
If available use the post-processing provided by Hailo for some popular models like the YOLO family. This can be added during the model conversion using nms_postprocess ALLS command.
If available make use of SoC peripherals e.g. for video decode to save main CPU cycles.

1 Like