yolov11m fps is too low?

Hi,

I am running YOLOv11m on Hailo8 in a real-time camera pipeline (25 FPS input),
using async inference with batch size = 1.

When measuring single-frame inference latency, I consistently see:

  • wait_for_async_ready(): ~80–88 ms
  • Total per-frame inference latency: ~85 ms
  • Effective FPS when waiting per frame: ~11–12 FPS

Other stages (buffer setup, callbacks, postprocess) are negligible (<1 ms).

I don’t face with this latency when i use smaller models such as yolov11s/10s.

I understand that Hailo’s reported ~40–50 FPS for YOLOv11m likely refers to
pipelined throughput with multiple in-flight frames.

1- Is ~85 ms the expected inference latency for YOLOv11m?
2- Do the benchmark FPS numbers assume pipelined execution without waiting between frames?

Hi @Omer_Edemen ,

We built the hailo-apps/hailo_apps/python/pipeline_apps/detection_simple/detection_simple_pipeline.py at main · hailo-ai/hailo-apps · GitHub as a reference for the stripped down detection pipeline to the bare minimum, which allow performance benchmark.
You can compare it to the pipeline at: hailo-apps/hailo_apps/python/pipeline_apps/detection at main · hailo-ai/hailo-apps · GitHub
and also use the --dump-dot and then dot -Tpng -o output.png pipeline.dot in order to visualize the pipeline.
Note both are Gstreamer pipelines. But that kind of analysis can be beneficial in order to gain visibility into some deeper details.

Hi @Michael ,

I used object detection code of yours that is written in c++. I get 11-12 fps and 80-85ms latency with yolov11m as well.

Please have a look at the following post I wrote a while back for some basic insights.

Hailo Community - My model runs slower than expected

You can measure the latency using the HailoRT CLI run command.

hailortcli run model.hef --measure-latency

If you are using the YOLOv11m HEF from the Model Zoo, the expected throughput is around ~50 FPS on a Hailo-8.

Mostly yes.

Benchmark FPS numbers assume that the application feeds frames asynchronously and does not block on the result of each frame. When a network is compiled into a single context, all layers are executed in parallel, and achieving maximum FPS requires the application to keep the pipeline full.

For networks compiled into multiple contexts, this matters less, as context switching introduces a significant overhead either way, which limits achievable throughput regardless of application behavior.

If you’re measuring end-to-end latency (including preprocessing, postprocessing, or CPU waits), values can be significantly higher than the accelerator-only latency.