I am running YOLOv8m (input size 640×640) on a Raspberry Pi 5 with Hailo accelerator. According to the Hailo documentation, the expected performance is around 130–150 FPS, but I am only getting about 16 FPS in my setup.
I would like to understand how I can optimize my pipeline to improve performance. Specifically:
Should I use INT8 quantization with a proper calibration dataset instead of FP16 to achieve higher FPS?
Does reducing the input resolution (e.g., 512×512 or 416×416) significantly boost FPS without much accuracy loss?
How can I tune batch size to increase throughput while keeping latency acceptable?
Could NMS (non-max suppression) post-processing on the CPU be causing the bottleneck, and how can I move it to the accelerator?
Are there any reference scripts or best practices for running YOLOv8m efficiently on Raspberry Pi 5 with Hailo?
My goal is to achieve >40 FPS with YOLOv8m while maintaining good accuracy. Any guidance, tips, or examples would be highly appreciated.
That is on a x86 machine with 4 PCIe lanes and batch-size 8.
The Hailo device always uses integer operations. That is one part why our architecture is more power efficient.
Every bit helps. If you need a significant boos have a look at the yolov8n model. You can use the Model Explorer in the Developer Zone to compare models on accuracy vs speed.
Yes, on weak CPUs pre- and post-processing can be the bottleneck. The Hailo accelerators have been designed to do the heavy compute of inference at very high efficiency. But they cannot do everything.
We do have our application code examples and Raspberry Pi examples as a starting point. Have a look into our GitHub repositories.
Hey, facing the same issue. I am working on iMX-8 plus with hailo 8 and i am getting 22 FPS with image size 640x512, batch size = 1. It is a real time inference device.
My goal is to reach 50 FPS too. How can i optimize it ?
Batch size which you guys have mentioned is it related to tiling technique or simply 2 frames it will take together for inference ?
In addition to what shashi mentioned, if you compiled without the highest compiler optimization level, this could be a limiting factor for the FPS you are seeing.
Hi @shashi@lawrence ,
As my use case is real time detection with single camera only, with above explanations I think batch size is useless for me. @lawrence actually I have kept the performance flag disabled during compilation because due to that my accuracy was dropping a lot.
Is there a way by which I can increase FPS even by a small number from my current situation?
The performance flag should not change accuracy at all. Make sure you are looking at compiler optimization level and not the optimization level. The optimization level changes quantization settings, the compiler optimization level simply optimizes how the layers are split into contexts.