Poor Performance | Hailo8L Slower Than iGPU

m.wijkhuizen · November 14, 2024, 11:55am

I have the following setup

Hailo8L
MobileNetV4 Conv Large model ~40M parameters
Intel Iris Xe iGPU

With the following performance metrics using a batch size of 1 with an input size of 320x320x1.

ONNX float32 Intel Iris Xe iGPU: ~50FPS
uint8 HEF Hailo8L: ~35FPS

I checked the PCIE throughput and got ~13,000Mbit/s

It surprises me the Hailo8L is slower running a quantized uint8 model than the iGPU is running a float32 ONNX model.

It seems odd to me a dedicated AI accelerator is slower than an old iGPU, especially since the Hailo8L is running an uint8 quantized model and the iGPU is running a full precision float32 model.

What could be the explanation for this?

Starting Measurements...
Measuring FPS in HW-only mode
Network model/model: 100% | 529 | FPS: 35.23 | ETA: 00:00:00
Measuring FPS (and Power on supported platforms) in streaming mode
Network model/model: 100% | 544 | FPS: 36.26 | ETA: 00:00:00
Measuring HW Latency
Network model/model: 100% | 518 | HW Latency: 28.28 ms | ETA: 00:00:00

=======
Summary
=======
FPS     (hw_only)                 = 35.2285
        (streaming)               = 36.2566
Latency (hw)                      = 28.2813 ms

KlausK · November 14, 2024, 10:21pm

The model does not fit into a single Hailo-8L and is therefore divided into multiple contexts. Run the following command to confirm:

hailortcli parse-hef model.hef

With multi context models you can increase the throughput at the cost of latency by running batches of images. That will reduce the context switching overhead. e.g.,

hailortcli run model.hef --batch-size 2
hailortcli run model.hef --batch-size 4
hailortcli run model.hef --batch-size 8

Let me know what results you get.

This post might be an interesting read as well.

Hailo Community - My model runs slower than expected

m.wijkhuizen · November 15, 2024, 8:18am

The model indeed consists of 10 contexts.

To get better performance I should thus either use a smaller model, use 4bit quantization or use a smaller input size?

Any other tips on how to reduce the number of contexts?

Network group name: model, Multi Context - Number of contexts: 10
    Network name: model/model
        VStream infos:
            Input  model/input_layer1 UINT8, NHWC(320x320x1)
            Output model/conv68 UINT8, NHWC(40x40x1)
            Output model/conv70 UINT8, NHWC(1x40x1)
            Output model/conv69 UINT8, NHWC(40x40x1)

m.wijkhuizen · November 15, 2024, 9:40am

What is the number of contexts based on, is it purely number of parameters or are there other factors involved?

In general, how would you choose a computer vision backbone to fit in a single context?

m.wijkhuizen · November 15, 2024, 11:16am

How can I force the compiler to focus on a single context minimum latency model only used with a batch size of 1?

hailo compiler --hw-arch hailo8l model_optimized.har

KlausK · November 19, 2024, 12:39am

The number of contexts is based on the three resources (compute, memory and control) we have on the Hailo device.

You can find this information at the end of the compiler output and the profiler report.

The compiler will try this per default. However with the Performance Mode the compiler will try harder. This will require significant longer time to complete. Please check the Hailo Dataflow Compiler User Guide performance_param.

m.wijkhuizen · November 20, 2024, 8:24am

Setting the compiler_optimization_level to max helps a lot.

performance_param(compiler_optimization_level=max)

It does however still optimize for maximum FPS, and not minimum latency.

[info] Resources optimization guidelines: Strategy -> GREEDY Objective -> MAX_FPS

How would you set the objective to MIN_LATENCY ?

Topic		Replies	Views
Poor performance of Hailo8L and Rpi5 General raspberry-pi , performance	6	835	March 20, 2025
Model inference on Hailo-8 vs Hailo-8L General	4	874	August 29, 2024
My model runs slower than expected General debug , optimization	1	726	July 17, 2024
Performance issue on Rpi5 and Hailo8 General pcie , raspberry-pi , hailo8 , performance	1	640	April 17, 2024
Hailo-8 Benchmark score issue General	4	142	February 11, 2025

Poor Performance | Hailo8L Slower Than iGPU

Related topics