Bottlenecks using multiple Hailo8 modules

Jonathan_F · September 12, 2024, 11:34am

I’m integrating multiple m.2 modules in a system and trying to understand the bottlenecks.

A PCIe Gen3 switch is used to attach four m.2 slots (with 4 lanes each) via 8 lanes upstream to CPU, like so:

CPU => PCIe3 8x => PCIe3 Switch |=> PCIe3 4x m.2
                                |=> PCIe3 4x m.2
                                |=> PCIe3 4x m.2
                                |=> PCIe3 4x m.2

1. Test with yolov5m_wo_spp_60p

Starting hailo run yolov5m_wo_spp_60p.hef on a single Hailo-8 modul gives me the same performance numbers as advertised:

ubuntu@dev:~/HAILO$ hailo run yolov5m_wo_spp_60p.hef
(hailo) Running command 'run' with 'hailortcli'
Running streaming inference (yolov5m_wo_spp_60p.hef):
  Transform data: true
    Type:      auto
    Quantized: true
Network yolov5m_wo_spp_60p/yolov5m_wo_spp_60p: 100% | 1088 | FPS: 217.33 | ETA: 00:00:00
> Inference result:
 Network group: yolov5m_wo_spp_60p
    Frames count: 1088
    FPS: 217.34
    Send Rate: 2136.58 Mbit/s
    Recv Rate: 3759.88 Mbit/s

However, when using multiple devices the performance stops scaling with the third device:

–device-count 1: 217 fps @5.5W
–device-count 2: 434 fps @5.5W
–device-count 3: 540 fps @4.7W
–device-count 4: 547 fps @3.8W

ubuntu@dev:~$ hailo run yolov5m_wo_spp_60p.hef --device-count 2
(hailo) Running command 'run' with 'hailortcli'
Running streaming inference (yolov5m_wo_spp_60p.hef):
  Transform data: true
    Type:      auto
    Quantized: true
Network yolov5m_wo_spp_60p/yolov5m_wo_spp_60p: 100% | 2176 | FPS: 434.20 | ETA: 00:00:00
> Inference result:
 Network group: yolov5m_wo_spp_60p
    Frames count: 2176
    FPS: 434.22
    Send Rate: 4268.51 Mbit/s
    Recv Rate: 7511.58 Mbit/s

ubuntu@dev:~$ hailo run yolov5m_wo_spp_60p.hef --device-count 3
(hailo) Running command 'run' with 'hailortcli'
Running streaming inference (yolov5m_wo_spp_60p.hef):
  Transform data: true
    Type:      auto
    Quantized: true
Network yolov5m_wo_spp_60p/yolov5m_wo_spp_60p: 100% | 2708 | FPS: 540.27 | ETA: 00:00:00
> Inference result:
 Network group: yolov5m_wo_spp_60p
    Frames count: 2708
    FPS: 540.30
    Send Rate: 5311.38 Mbit/s
    Recv Rate: 9346.78 Mbit/s

ubuntu@dev:~$ hailo run yolov5m_wo_spp_60p.hef --device-count 4
(hailo) Running command 'run' with 'hailortcli'
Running streaming inference (yolov5m_wo_spp_60p.hef):
  Transform data: true
    Type:      auto
    Quantized: true
Network yolov5m_wo_spp_60p/yolov5m_wo_spp_60p: 100% | 2745 | FPS: 547.60 | ETA: 00:00:00
> Inference result:
 Network group: yolov5m_wo_spp_60p
    Frames count: 2745
    FPS: 547.62
    Send Rate: 5383.35 Mbit/s
    Recv Rate: 9473.44 Mbit/s

I first suspected the PCIe bandwidth, but with 8 lanes upstream that’s 64 Gbits/s and we’re reaching max 14,8 Gbit/s here…

2. Test with yolov5m_vehicles

In comparasion, the yolov5m_vehicles does scales as expected.
=> 80 fps * 4 = 320 fps

The obvious difference being the very asymmetric bus usage:

ubuntu@dev:~$ hailo run yolov5m_vehicles.hef --device-count 1
(hailo) Running command 'run' with 'hailortcli'
Running streaming inference (yolov5m_vehicles.hef):
  Transform data: true
    Type:      auto
    Quantized: true
Network yolov5m_vehicles/yolov5m_vehicles: 100% | 401 | FPS: 80.10 | ETA: 00:00:00
> Inference result:
 Network group: yolov5m_vehicles
    Frames count: 401
    FPS: 80.10
    Send Rate: 3986.53 Mbit/s
    Recv Rate: 97.82 Mbit/s

ubuntu@dev:~$ hailo run yolov5m_vehicles.hef --device-count 4
(hailo) Running command 'run' with 'hailortcli'
Running streaming inference (yolov5m_vehicles.hef):
  Transform data: true
    Type:      auto
    Quantized: true
Network yolov5m_vehicles/yolov5m_vehicles: 100% | 1604 | FPS: 319.32 | ETA: 00:00:00
> Inference result:
 Network group: yolov5m_vehicles
    Frames count: 1604
    FPS: 319.34
    Send Rate: 15892.29 Mbit/s
    Recv Rate: 389.95 Mbit/s

Any clue what could cause this and where to dig further?

Jonathan_F · September 17, 2024, 3:36pm

Solved: It seems to have been a CPU related pcie bottleneck.
Initially used a rather old Intel Xeon 5th Gen. Now switched to 11th Gen and I see linear FPS scaling for all models like expected.

omria · September 18, 2024, 4:31am

Great job! You’re absolutely right—the bottleneck with Hailo devices is often not the device itself, but the PCIe speed and bandwidth.

To improve performance, you can also try running the models with the --batch-size option, which may help you achieve higher FPS (frames per second).

Topic		Replies	Views
Poor performance of Hailo8L and Rpi5 General raspberry-pi , performance	6	889	March 20, 2025
Multi stream example with Multi device General hailo8	8	680	June 23, 2024
Hailo farm of 16 for ollama General hailo8	2	2134	December 31, 2024
yolov7.hef(from model zoo) with multi device Execution Hang General hailort , hailo8	3	62	May 20, 2025
Swap Hailo 8L for 8 General	4	525	March 26, 2025

Bottlenecks using multiple Hailo8 modules

1. Test with yolov5m_wo_spp_60p

2. Test with yolov5m_vehicles

Related topics