Hailo8 Pi5 on PCIe Gen2 and Gen3 HATS no inference speed difference

Hi all,

Looking for a sense check as I believe I am missing something obvious.

Two Pi5s one with Raspberry Pi M.2 PCIe2 HAT, one with the seeed dual M.2 PCIe3 HAT

When running a test model and enabling and disabling in the boot config.txt under [all]

dtparam=pciex
dtparam=pciex1_gen=3

I saw the following benchmarks.

Note that in our case we cannot batch because we run on real time video so there is a known bottleneck there. To me it looks like either we are not running at full speed even on gen2, because of this bottleneck, which explains why the switch from gen 2-3 and across the two M.2 HATS shows no fps difference.

Pi5 running Raspberry Pi m.2 HAT (not PCIe 3 enabled)

PCIe2

robdupre@PI5-16GB:~$ hailortcli run2 -t 20 set-net model.hef [HailoRT CLI] [warning] “hailortcli run2” is not optimized for single model usage. It is recommended to use “hailortcli run” command for a single model [===================>] 100% 00:00:00 joined_preprocess_nn: fps: 352.69

PCIe3 (expected no change)

robdupre@PI5-16GB:~$ hailortcli run2 -t 20 set-net model.hef [HailoRT CLI] [warning] “hailortcli run2” is not optimized for single model usage. It is recommended to use “hailortcli run” command for a single model [===================>] 100% 00:00:00 joined_preprocess_nn: fps: 351.54

Pi5 running Seeed Dual PCIe3 m.2 HAT

PCIe2

vcacore@PI5:~$ hailortcli run2 -t 20 set-net model.hef [HailoRT CLI] [warning] “hailortcli run2” is not optimized for single model usage. It is recommended to use “hailortcli run” command for a single model [===================>] 100% 00:00:00 joined_preprocess_nn: fps: 352.69

PCIe3

vcacore@PI5:~$ hailortcli run2 -t 20 set-net model.hef [HailoRT CLI] [warning] “hailortcli run2” is not optimized for single model usage. It is recommended to use “hailortcli run” command for a single model [===================>] 100% 00:00:00 joined_preprocess_nn: fps: 350.79

Hi @rob.dupre,

Welcome to the Hailo Community!

First of all, please run the sudo lspci -vvv command
You should see the Hailo device listed as Co-processor: Hailo Technologies Ltd.
Check the LnkSta field.
With PCIe 2.0 configuration, you should see:

LnkSta:	Speed 5GT/s (downgraded), Width x1 (downgraded)

With PCIe 3.0 configuration, you should see instead:

LnkSta:	Speed 8GT/s, Width x1 (downgraded)

After checking the negotiated link speed and making sure that it is correct, let’s analyze the model.
A better or worse PCIe connection may not have impact on the performance of every model. As you may know, Hailo-8 supports up to PCIe Gen 3.0 x4 lanes, but only one lane is used on the Raspberry Pi.
Having higher PCIe bandwidth is beneficial in some cases, for example:

  • The model has large input/output feature maps
  • The model has been allocated into multiple contexts, which require dynamic reconfiguration at runtime
  • You are switching between multiple models

Given that your model runs at 351 FPS, it is likely that it was allocated in a single context and sits entire on the device at runtime (no need for device reconfiguration). You can verify this using hailortcli parse-hef <HEF-PATH>.
I would suggest repeating the test using a model from the Hailo Model Zoo that is multi-context, e.g. YoloV8m

Hi Thanks for the reply!

I will add the additional info as asked:

vcacore@ABP100:~$  hailortcli parse-hef model.hef
Architecture HEF was compiled for: HAILO8
Network group name: joined_preprocess_nn, Single Context
    Network name: joined_preprocess_nn/nn_preprocess

Input is 672x627. Model info is cropped as it is a commercial model

The model is static and cannot be changed, and we are broadly happy with the performance but wanted to fully understand the PCIe relationship with the module

I have rerun the tests with the seeed dual PCI3.0 m.2 HAT (only single slot used) with the additional lspci -vv output for clarity, with different /boot/firmware/config.txt configs

Pi5 running Seeed Dual PCIe 3.0 m.2 HAT

PCIe2

vcacore@ABP100:~$ sudo lspci -vvv
		LnkSta:	Speed 5GT/s, Width x4
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

/boot/firmware/config.txt

[all]

vcacore@ABP100:~$ hailortcli run2 -t 20 set-net model.hef
[HailoRT CLI] [warning] "hailortcli run2" is not optimized for single model usage. It is recommended to use "hailortcli run" command for a single model
[===================>] 100% 00:00:00
joined_preprocess_nn: fps: 352.70

PCIe2

vcacore@ABP100:~$ sudo lspci -vvv
		LnkSta:	Speed 8GT/s, Width x1 (downgraded)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

/boot/firmware/config.txt

[all]
dtparam=pciex
dtparam=pciex1_gen=2

vcacore@ABP100:~$ hailortcli run2 -t 20 set-net model.hef
[HailoRT CLI] [warning] "hailortcli run2" is not optimized for single model usage. It is recommended to use "hailortcli run" command for a single model
[===================>] 100% 00:00:00
joined_preprocess_nn: fps: 352.69

PCIe3

vcacore@ABP100:~$ sudo lspci -vvv
LnkSta:	Speed 8GT/s, Width x1 (downgraded)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

/boot/firmware/config.txt

[all]
dtparam=pciex
dtparam=pciex1_gen=3
vcacore@ABP100:~$ hailortcli run2 -t 20 set-net model.hef
[HailoRT CLI] [warning] "hailortcli run2" is not optimized for single model usage. It is recommended to use "hailortcli run" command for a single model
[===================>] 100% 00:00:00
joined_preprocess_nn: fps: 352.79

@rob.dupre The configuration seems correct.
The model is single context, therefore it is entirely loaded to the device at runtime (no dynamic reconfiguration involved). It is likely that the PCIe is not the bottleneck in this case, thus you see no difference between PCIe 2.0 and PCIe 3.0. For testing purposes, if you repeat the measurements with a multi-context model from the Model Zoo, you may notice a difference between PCIe 2.0 and PCIe 3.0.

It seems that this model can run at most at 352 FPS due to the way it was compiled.

If your plan is to increase the FPS (but this is not related to PCIe 2.0/3.0 configuration), you can try compiling the model in performance mode, adding the following command in the model script:

performance_param(compiler_optimization_level=max)

The compiler will run an exhaustive search to try to increase the FPS of the model. But compilation time will increase as well.

Thanks!

We will do some further digging!

Hi @rob.dupre
When the RPi detects the Hailo AI HAT+, it automatically sets its PCIe connection to Gen3. (The GPIO header must be connected, as the Pi reads the HAT’s EEPROM using the I2C interface).

Note that this is not the case for the AI Kit. In this scenario, the RPi does not recognize that it is communicating with a Hailo chip, so you must manually set the PCIe link to Gen3.