Poor performance of Hailo8L and Rpi5

Hello guys,

if you could please advise me in my issue. First of all, I’ve read:

and

Here are the outputs:

hailortcli run yolov6n.hef --batch-size 1
Running streaming inference (yolov6n.hef):
  Transform data: true
    Type:      auto
    Quantized: true
Network yolov6n/yolov6n: 100% | 1408 | FPS: 281.26 | ETA: 00:00:00
> Inference result:
 Network group: yolov6n
    Frames count: 1408
    FPS: 281.26
    Send Rate: 2764.91 Mbit/s
    Recv Rate: 1621.87 Mbit/s

Expected 355 according to model zoo.

hailortcli run yolov6n.hef --batch-size 8
Running streaming inference (yolov6n.hef):
  Transform data: true
    Type:      auto
    Quantized: true
Network yolov6n/yolov6n: 100% | 1408 | FPS: 281.24 | ETA: 00:00:00
> Inference result:
 Network group: yolov6n
    Frames count: 1408
    FPS: 281.24
    Send Rate: 2764.74 Mbit/s
    Recv Rate: 1621.77 Mbit/s

Expected 355 according to model zoo.

hailortcli run yolov7.hef --batch-size 1
Running streaming inference (yolov7.hef):
  Transform data: true
    Type:      auto
    Quantized: true
Network yolov7/yolov7: 100% | 45 | FPS: 9.00 | ETA: 00:00:00
> Inference result:
 Network group: yolov7
    Frames count: 45
    FPS: 9.00
    Send Rate: 88.46 Mbit/s
    Recv Rate: 155.67 Mbit/s

Expected 25 according to model zoo.

hailortcli run yolov7.hef --batch-size 8
Running streaming inference (yolov7.hef):
  Transform data: true
    Type:      auto
    Quantized: true
Network yolov7/yolov7: 100% | 78 | FPS: 15.58 | ETA: 00:00:00
> Inference result:
 Network group: yolov7
    Frames count: 78
    FPS: 15.58
    Send Rate: 153.12 Mbit/s
    Recv Rate: 269.45 Mbit/s

Expected 35 according to model zoo.

sudo lspci -vv

0000:01:00.0 Co-processor: Hailo Technologies Ltd. Hailo-8 AI Processor (rev 01)
	Subsystem: Hailo Technologies Ltd. Hailo-8 AI Processor
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 185
	Region 0: Memory at 1800000000 (64-bit, prefetchable) [size=16K]
	Region 2: Memory at 1800008000 (64-bit, prefetchable) [size=4K]
	Region 4: Memory at 1800004000 (64-bit, prefetchable) [size=16K]
	Capabilities: [80] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0W
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <1us, L1 <2us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s (downgraded), Width x1 (downgraded)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis+ NROPrPrP- LTR+
			 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- TPHComp- ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ 10BitTagReq- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
			 EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [e0] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 000000ffffffe000  Data: 0008
	Capabilities: [f8] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [100 v1] Vendor Specific Information: ID=1556 Rev=1 Len=008 <?>
	Capabilities: [108 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [110 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us LTR1.2_Threshold=26016ns
		L1SubCtl2: T_PwrOn=10us
	Capabilities: [128 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 0
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [200 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [300 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Kernel driver in use: hailo
	Kernel modules: hailo_pci

What could be the case of my slow downs? Thanks for any help.

There are two points.

Your Raspberry Pi is set to PCI gen 2. Please follow this guide to upgrade it to PCIe gen 3.

How to upgrade to PCIe gen-3 in RPi5 with Hailo-8L M.2

The second point, the numbers in the Model Zoo are for a Hailo device connected to a system with 4 PCIe lanes and a x86 CPU. Your Raspberry Pi only has a single lane.

Especially for multi-context networks PCIe bandwidth will influence the maximum FPS you can reach.

@klausk many thanks for your time, I’ve read somewhere on the forum that anything below 5GT/s means that the PCI is set to gen 2, since it was 5GT/s I thought I am on PCI gen-3.

After changing it to gen-3 it works way faster, thank you.

May I ask one more thing: how do you measure the performance (FPS) in hailortcli?

I am measuring it like this (Python):

                inference_start = time.time()

                bindings.input().set_buffer(np.array(preprocessed_image))
                configured_infer_model.run([bindings], 1000)
                buffer = bindings.output().get_buffer()

                inference_end = time.time()
                inference_time = inference_end - inference_start

I am getting 0.011 s, which translates to 90 FPS f(for yolov6n.hef) which is nowhere near 281 FPS from hailortcli run yolov6n.hef --batch-size 1 on my machine.

Even if I measure only the inference time:

                inference_start = time.time()
                
                configured_infer_model.run([bindings], 1000)

                inference_end = time.time()
                inference_time = inference_end - inference_start

I am getting 0,008s so around 125 FPS.

We push as many frames into the device as possible for a few seconds and count them.

For Hailo devices FPS is not the inverse of latency. As soon as the first layer finishes computing the last row of an image it can start working on the next frame. It is a true pipeline and allows higher throughput.

This of course requires the software to be able to work on the input and output streams independently. If you use a blocking software by calling infer and waiting for the result in a single thread you will not reach the maximum throughput. This requires two independent threads.

Many thanks klaus.

Which of the examples from here: GitHub - hailo-ai/Hailo-Application-Code-Examples is using that approach? I’ve analyzed object detection (Python) and I see there two threads and queues and batch processing. But batch processing does not necessarily mean the same what you are talking about, right? How can I know that hailo is ready to accept a new image?