Bottlenecks using multiple Hailo8 modules

I’m integrating multiple m.2 modules in a system and trying to understand the bottlenecks.

A PCIe Gen3 switch is used to attach four m.2 slots (with 4 lanes each) via 8 lanes upstream to CPU, like so:

CPU => PCIe3 8x => PCIe3 Switch |=> PCIe3 4x m.2
                                |=> PCIe3 4x m.2
                                |=> PCIe3 4x m.2
                                |=> PCIe3 4x m.2

1. Test with yolov5m_wo_spp_60p

Starting hailo run yolov5m_wo_spp_60p.hef on a single Hailo-8 modul gives me the same performance numbers as advertised:

ubuntu@dev:~/HAILO$ hailo run yolov5m_wo_spp_60p.hef
(hailo) Running command 'run' with 'hailortcli'
Running streaming inference (yolov5m_wo_spp_60p.hef):
  Transform data: true
    Type:      auto
    Quantized: true
Network yolov5m_wo_spp_60p/yolov5m_wo_spp_60p: 100% | 1088 | FPS: 217.33 | ETA: 00:00:00
> Inference result:
 Network group: yolov5m_wo_spp_60p
    Frames count: 1088
    FPS: 217.34
    Send Rate: 2136.58 Mbit/s
    Recv Rate: 3759.88 Mbit/s

However, when using multiple devices the performance stops scaling with the third device:

–device-count 1: 217 fps @5.5W :white_check_mark:
–device-count 2: 434 fps @5.5W :white_check_mark:
–device-count 3: 540 fps @4.7W :x:
–device-count 4: 547 fps @3.8W :x:

ubuntu@dev:~$ hailo run yolov5m_wo_spp_60p.hef --device-count 2
(hailo) Running command 'run' with 'hailortcli'
Running streaming inference (yolov5m_wo_spp_60p.hef):
  Transform data: true
    Type:      auto
    Quantized: true
Network yolov5m_wo_spp_60p/yolov5m_wo_spp_60p: 100% | 2176 | FPS: 434.20 | ETA: 00:00:00
> Inference result:
 Network group: yolov5m_wo_spp_60p
    Frames count: 2176
    FPS: 434.22
    Send Rate: 4268.51 Mbit/s
    Recv Rate: 7511.58 Mbit/s

ubuntu@dev:~$ hailo run yolov5m_wo_spp_60p.hef --device-count 3
(hailo) Running command 'run' with 'hailortcli'
Running streaming inference (yolov5m_wo_spp_60p.hef):
  Transform data: true
    Type:      auto
    Quantized: true
Network yolov5m_wo_spp_60p/yolov5m_wo_spp_60p: 100% | 2708 | FPS: 540.27 | ETA: 00:00:00
> Inference result:
 Network group: yolov5m_wo_spp_60p
    Frames count: 2708
    FPS: 540.30
    Send Rate: 5311.38 Mbit/s
    Recv Rate: 9346.78 Mbit/s

ubuntu@dev:~$ hailo run yolov5m_wo_spp_60p.hef --device-count 4
(hailo) Running command 'run' with 'hailortcli'
Running streaming inference (yolov5m_wo_spp_60p.hef):
  Transform data: true
    Type:      auto
    Quantized: true
Network yolov5m_wo_spp_60p/yolov5m_wo_spp_60p: 100% | 2745 | FPS: 547.60 | ETA: 00:00:00
> Inference result:
 Network group: yolov5m_wo_spp_60p
    Frames count: 2745
    FPS: 547.62
    Send Rate: 5383.35 Mbit/s
    Recv Rate: 9473.44 Mbit/s

I first suspected the PCIe bandwidth, but with 8 lanes upstream that’s 64 Gbits/s and we’re reaching max 14,8 Gbit/s here…

2. Test with yolov5m_vehicles

In comparasion, the yolov5m_vehicles does scales as expected.
=> 80 fps * 4 = 320 fps

The obvious difference being the very asymmetric bus usage:

ubuntu@dev:~$ hailo run yolov5m_vehicles.hef --device-count 1
(hailo) Running command 'run' with 'hailortcli'
Running streaming inference (yolov5m_vehicles.hef):
  Transform data: true
    Type:      auto
    Quantized: true
Network yolov5m_vehicles/yolov5m_vehicles: 100% | 401 | FPS: 80.10 | ETA: 00:00:00
> Inference result:
 Network group: yolov5m_vehicles
    Frames count: 401
    FPS: 80.10
    Send Rate: 3986.53 Mbit/s
    Recv Rate: 97.82 Mbit/s

ubuntu@dev:~$ hailo run yolov5m_vehicles.hef --device-count 4
(hailo) Running command 'run' with 'hailortcli'
Running streaming inference (yolov5m_vehicles.hef):
  Transform data: true
    Type:      auto
    Quantized: true
Network yolov5m_vehicles/yolov5m_vehicles: 100% | 1604 | FPS: 319.32 | ETA: 00:00:00
> Inference result:
 Network group: yolov5m_vehicles
    Frames count: 1604
    FPS: 319.34
    Send Rate: 15892.29 Mbit/s
    Recv Rate: 389.95 Mbit/s

Any clue what could cause this and where to dig further?

Solved: It seems to have been a CPU related pcie bottleneck.
Initially used a rather old Intel Xeon 5th Gen. Now switched to 11th Gen and I see linear FPS scaling for all models like expected.

1 Like

Great job! You’re absolutely right—the bottleneck with Hailo devices is often not the device itself, but the PCIe speed and bandwidth.

To improve performance, you can also try running the models with the --batch-size option, which may help you achieve higher FPS (frames per second).