I’m integrating multiple m.2 modules in a system and trying to understand the bottlenecks.
A PCIe Gen3 switch is used to attach four m.2 slots (with 4 lanes each) via 8 lanes upstream to CPU, like so:
CPU => PCIe3 8x => PCIe3 Switch |=> PCIe3 4x m.2
|=> PCIe3 4x m.2
|=> PCIe3 4x m.2
|=> PCIe3 4x m.2
1. Test with yolov5m_wo_spp_60p
Starting hailo run yolov5m_wo_spp_60p.hef
on a single Hailo-8 modul gives me the same performance numbers as advertised:
ubuntu@dev:~/HAILO$ hailo run yolov5m_wo_spp_60p.hef
(hailo) Running command 'run' with 'hailortcli'
Running streaming inference (yolov5m_wo_spp_60p.hef):
Transform data: true
Type: auto
Quantized: true
Network yolov5m_wo_spp_60p/yolov5m_wo_spp_60p: 100% | 1088 | FPS: 217.33 | ETA: 00:00:00
> Inference result:
Network group: yolov5m_wo_spp_60p
Frames count: 1088
FPS: 217.34
Send Rate: 2136.58 Mbit/s
Recv Rate: 3759.88 Mbit/s
However, when using multiple devices the performance stops scaling with the third device:
–device-count 1: 217 fps @5.5W
–device-count 2: 434 fps @5.5W
–device-count 3: 540 fps @4.7W
–device-count 4: 547 fps @3.8W
ubuntu@dev:~$ hailo run yolov5m_wo_spp_60p.hef --device-count 2
(hailo) Running command 'run' with 'hailortcli'
Running streaming inference (yolov5m_wo_spp_60p.hef):
Transform data: true
Type: auto
Quantized: true
Network yolov5m_wo_spp_60p/yolov5m_wo_spp_60p: 100% | 2176 | FPS: 434.20 | ETA: 00:00:00
> Inference result:
Network group: yolov5m_wo_spp_60p
Frames count: 2176
FPS: 434.22
Send Rate: 4268.51 Mbit/s
Recv Rate: 7511.58 Mbit/s
ubuntu@dev:~$ hailo run yolov5m_wo_spp_60p.hef --device-count 3
(hailo) Running command 'run' with 'hailortcli'
Running streaming inference (yolov5m_wo_spp_60p.hef):
Transform data: true
Type: auto
Quantized: true
Network yolov5m_wo_spp_60p/yolov5m_wo_spp_60p: 100% | 2708 | FPS: 540.27 | ETA: 00:00:00
> Inference result:
Network group: yolov5m_wo_spp_60p
Frames count: 2708
FPS: 540.30
Send Rate: 5311.38 Mbit/s
Recv Rate: 9346.78 Mbit/s
ubuntu@dev:~$ hailo run yolov5m_wo_spp_60p.hef --device-count 4
(hailo) Running command 'run' with 'hailortcli'
Running streaming inference (yolov5m_wo_spp_60p.hef):
Transform data: true
Type: auto
Quantized: true
Network yolov5m_wo_spp_60p/yolov5m_wo_spp_60p: 100% | 2745 | FPS: 547.60 | ETA: 00:00:00
> Inference result:
Network group: yolov5m_wo_spp_60p
Frames count: 2745
FPS: 547.62
Send Rate: 5383.35 Mbit/s
Recv Rate: 9473.44 Mbit/s
I first suspected the PCIe bandwidth, but with 8 lanes upstream that’s 64 Gbits/s and we’re reaching max 14,8 Gbit/s here…
2. Test with yolov5m_vehicles
In comparasion, the yolov5m_vehicles
does scales as expected.
=> 80 fps * 4 = 320 fps
The obvious difference being the very asymmetric bus usage:
ubuntu@dev:~$ hailo run yolov5m_vehicles.hef --device-count 1
(hailo) Running command 'run' with 'hailortcli'
Running streaming inference (yolov5m_vehicles.hef):
Transform data: true
Type: auto
Quantized: true
Network yolov5m_vehicles/yolov5m_vehicles: 100% | 401 | FPS: 80.10 | ETA: 00:00:00
> Inference result:
Network group: yolov5m_vehicles
Frames count: 401
FPS: 80.10
Send Rate: 3986.53 Mbit/s
Recv Rate: 97.82 Mbit/s
ubuntu@dev:~$ hailo run yolov5m_vehicles.hef --device-count 4
(hailo) Running command 'run' with 'hailortcli'
Running streaming inference (yolov5m_vehicles.hef):
Transform data: true
Type: auto
Quantized: true
Network yolov5m_vehicles/yolov5m_vehicles: 100% | 1604 | FPS: 319.32 | ETA: 00:00:00
> Inference result:
Network group: yolov5m_vehicles
Frames count: 1604
FPS: 319.34
Send Rate: 15892.29 Mbit/s
Recv Rate: 389.95 Mbit/s
Any clue what could cause this and where to dig further?