Difference in performance for hmz yolov8s.hef and a build one.

Hi guys,

In our company we’ve compiled yolov8s model and it works pretty well, however there is a noticeable difference in performance, in comparison to the model downloaded from hailo model zoo.

Results for the stock hmz model:

$ hailortcli run yolov8s_official.hef --batch-size 1
Running streaming inference (yolov8s_official.hef):
  Transform data: true
    Type:      auto
    Quantized: true
Network yolov8s/yolov8s: 100% | 497 | FPS: 98.75 | ETA: 00:00:00
> Inference result:
 Network group: yolov8s
    Frames count: 497
    FPS: 98.76
    Send Rate: 970.81 Mbit/s
    Recv Rate: 964.74 Mbit/s

and here are the results for our model after couple of experiments with the DCF compiler options:

$ hailortcli run yolov8s_max_optimization.hef --batch-size 1
Running streaming inference (yolov8s_max_optimization.hef):
Transform data: true
Type:      auto
Quantized: true
Network yolov8s/yolov8s: 100% | 464 | FPS: 92.21 | ETA: 00:00:00

Inference result:
Network group: yolov8s
Frames count: 464
FPS: 92.21
Send Rate: 906.45 Mbit/s
Recv Rate: 406.60 Mbit/s

as you can see, the results are worse for about 6 FPS.

Here is the alls file:

normalization1 = normalization([0.0, 0.0, 0.0], [255.0, 255.0, 255.0])
change_output_activation(conv42, sigmoid)
change_output_activation(conv53, sigmoid)
change_output_activation(conv63, sigmoid)
nms_postprocess("../../postprocess_config/yolov8s_nms_config.json", meta_arch=yolov8, engine=cpu)
performance_param(compiler_optimization_level=max)

Here is the command that I’ve run to compile the model

hailomz compile --hw-arch hailo8l --ckpt yolov8s.onnx --calib-path /images/ --yaml /hailo_model_zoo/hailo_model_zoo/cfg/networks/yolov8s.yaml --classes 1

Is there any compile option that we are missing?

Thank you,

Chuck

The forum post below provides some useful guidance on the basics of analyzing performance issues:

Hailo Community - My model runs slower than expected

To help us better understand your specific case, could you please do the following:

  1. Parse both of your hef files using:
hailortcli parse-hef model.hef
  1. Provide some information about your system:
  • host CPU (x86 or Arm)
  • PCIe number of lanes (1, 2 or 4)
  • PCIe generation (1, 2 or 3)

Thanks!

hi @KlausK

Many thanks for the prompt reply, here are the requested information:

Model from hailo_model_zoo

hailortcli parse-hef yolov8s_hmz.hef
Architecture HEF was compiled for: HAILO8L
Network group name: yolov8s, Multi Context - Number of contexts: 3
    Network name: yolov8s/yolov8s
        VStream infos:
            Input  yolov8s/input_layer1 UINT8, NHWC(640x640x3)
            Output yolov8s/yolov8_nms_postprocess FLOAT32, HAILO NMS BY CLASS(number of classes: 80, maximum bounding boxes per class: 100, maximum frame size: 160320)
            Operation:
                Op YOLOV8
                Name: YOLOV8-Post-Process
                Score threshold: 0.200
                IoU threshold: 0.70
                Classes: 80
                Max bboxes per class: 100
                Image height: 640
                Image width: 640

Our model(we have only one object class):

hailortcli parse-hef yolov8s_max_optimization.hef
Architecture HEF was compiled for: HAILO8L
Network group name: yolov8s, Multi Context - Number of contexts: 3
    Network name: yolov8s/yolov8s
        VStream infos:
            Input  yolov8s/input_layer1 UINT8, NHWC(640x640x3)
            Output yolov8s/yolov8_nms_postprocess FLOAT32, HAILO NMS BY CLASS(number of classes: 1, maximum bounding boxes per class: 100, maximum frame size: 2004)
            Operation:
                Op YOLOV8
                Name: YOLOV8-Post-Process
                Score threshold: 0.200
                IoU threshold: 0.70
                Classes: 1
                Max bboxes per class: 100
                Image height: 640
                Image width: 640

I see the difference in maximum frame size - I don’t know what does that mean?

Here are the machine details, but that shouldn’t matter because I am comparing both models on the same machine?

02:00.0 Co-processor: Hailo Technologies Ltd. Hailo-8 AI Processor (rev 01)
Subsystem: Hailo Technologies Ltd. Hailo-8 AI Processor
Physical Slot: 0-2
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 16
Region 0: Memory at 380800004000 (64-bit, prefetchable) \[size=16K\]
Region 2: Memory at 380800008000 (64-bit, prefetchable) \[size=4K\]
Region 4: Memory at 380800000000 (64-bit, prefetchable) \[size=16K\]
Capabilities: \[80\] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25.000W
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <1us, L1 <2us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s (ok), Width x2 (downgraded)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Not Supported, TimeoutDis+ NROPrPrP- LTR+
10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: \[e0\] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000  Data: 0000
Capabilities: \[f8\] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
Capabilities: \[100 v1\] Vendor Specific Information: ID=1556 Rev=1 Len=008 <?>
Capabilities: \[108 v1\] Latency Tolerance Reporting
Max snoop latency: 15728640ns
Max no snoop latency: 15728640ns
Capabilities: \[200 v2\] Advanced Error Reporting
UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 04000001 0000000f 86f80000 54f14c95
Kernel driver in use: hailo
Kernel modules: hailo_pci
Architecture:                x86_64
  CPU op-mode(s):            32-bit, 64-bit
  Address sizes:             46 bits physical, 48 bits virtual
  Byte Order:                Little Endian
CPU(s):                      14
  On-line CPU(s) list:       0-13
Vendor ID:                   GenuineIntel
  Model name:                Intel(R) Core(TM) Ultra 7 265K
    CPU family:              6
    Model:                   198
    Thread(s) per core:      1
    Core(s) per socket:      14
    Socket(s):               1
    Stepping:                2
    BogoMIPS:                7756.43
    Flags:                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcn
                             t tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rd
                             seed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni wbnoinvd arat umip pku ospke waitpkg gfni vaes vpclmulqdq rdpid bus_lock_detect movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities
Virtualization features:     
  Virtualization:            VT-x
Caches (sum of all):         
  L1d:                       448 KiB (14 instances)
  L1i:                       448 KiB (14 instances)
  L2:                        56 MiB (14 instances)
  L3:                        16 MiB (1 instance)
NUMA:                        
  NUMA node(s):              1
  NUMA node0 CPU(s):         0-13
Vulnerabilities:             
  Gather data sampling:      Not affected
  Indirect target selection: Not affected
  Itlb multihit:             Not affected
  L1tf:                      Not affected
  Mds:                       Not affected
  Meltdown:                  Not affected
  Mmio stale data:           Not affected
  Reg file data sampling:    Not affected
  Retbleed:                  Not affected
  Spec rstack overflow:      Not affected
  Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS Not affected; BHI BHI_DIS_S
  Srbds:                     Not affected
  Tsa:                       Not affected
  Tsx async abort:           Not affected
  Vmscape:                   Not affected

Maybe the difference is in the YOLOv8 export itself? Could you tell me how to generate profiler results - these are included in hailo model zoo, I could compare the networks and look for the differences there.

Best regards,

Chuck

I’ve exported model network information via hailo profiler and compared both networks, they are the same except the postprocess.

Another information that I can give, is this:

[info] Partition to contexts finished successfully
[info] Partitioner finished after 135 iterations, Time it took: 5m 26s 304ms                                                                                                                                                                                                       
[info] Applying selected partition to 4 contexts...
[info] yolov8s_context_0 utilization:                                                                                                                                                                                                                                              
[info] +-----------+---------------------+---------------------+--------------------+
[info] | Cluster   | Control Utilization | Compute Utilization | Memory Utilization |                        
[info] +-----------+---------------------+---------------------+--------------------+
[info] | cluster_0 | 93.8%               | 87.5%               | 84.4%              |
[info] | cluster_1 | 75%                 | 76.6%               | 89.8%              | 
[info] | cluster_4 | 100%                | 98.4%               | 78.1%              |                                                                                                                                                                                              
[info] | cluster_5 | 75%                 | 98.4%               | 71.9%              |                                                                                                                                                                                              
[info] +-----------+---------------------+---------------------+--------------------+
[info] | Total     | 85.9%               | 90.2%               | 81.1%              |
[info] +-----------+---------------------+---------------------+--------------------+
[info] yolov8s_context_1 utilization:        
[info] +-----------+---------------------+---------------------+--------------------+
[info] | Cluster   | Control Utilization | Compute Utilization | Memory Utilization |
[info] +-----------+---------------------+---------------------+--------------------+
[info] | cluster_0 | 75%                 | 71.9%               | 56.3%              |
[info] | cluster_1 | 100%                | 89.1%               | 95.3%              |
[info] | cluster_4 | 87.5%               | 100%                | 60.2%              |
[info] | cluster_5 | 93.8%               | 100%                | 68%                |
[info] +-----------+---------------------+---------------------+--------------------+
[info] | Total     | 89.1%               | 90.2%               | 69.9%              |
[info] +-----------+---------------------+---------------------+--------------------+
[info] yolov8s_context_2 utilization: 
[info] +-----------+---------------------+---------------------+--------------------+
[info] | Cluster   | Control Utilization | Compute Utilization | Memory Utilization |
[info] +-----------+---------------------+---------------------+--------------------+
[info] | cluster_0 | 81.3%               | 84.4%               | 97.7%              |
[info] | cluster_1 | 87.5%               | 95.3%               | 71.9%              |
[info] | cluster_4 | 68.8%               | 92.2%               | 67.2%              |
[info] | cluster_5 | 100%                | 87.5%               | 98.4%              |
[info] +-----------+---------------------+---------------------+--------------------+
[info] | Total     | 84.4%               | 89.8%               | 83.8%              |
[info] +-----------+---------------------+---------------------+--------------------+