Poor Performance | Hailo8L Slower Than iGPU

I have the following setup

  • Hailo8L
  • MobileNetV4 Conv Large model ~40M parameters
  • Intel Iris Xe iGPU

With the following performance metrics using a batch size of 1 with an input size of 320x320x1.

  • ONNX float32 Intel Iris Xe iGPU: ~50FPS
  • uint8 HEF Hailo8L: ~35FPS

I checked the PCIE throughput and got ~13,000Mbit/s

It surprises me the Hailo8L is slower running a quantized uint8 model than the iGPU is running a float32 ONNX model.

It seems odd to me a dedicated AI accelerator is slower than an old iGPU, especially since the Hailo8L is running an uint8 quantized model and the iGPU is running a full precision float32 model.

What could be the explanation for this?

Starting Measurements...
Measuring FPS in HW-only mode
Network model/model: 100% | 529 | FPS: 35.23 | ETA: 00:00:00
Measuring FPS (and Power on supported platforms) in streaming mode
Network model/model: 100% | 544 | FPS: 36.26 | ETA: 00:00:00
Measuring HW Latency
Network model/model: 100% | 518 | HW Latency: 28.28 ms | ETA: 00:00:00

=======
Summary
=======
FPS     (hw_only)                 = 35.2285
        (streaming)               = 36.2566
Latency (hw)                      = 28.2813 ms

The model does not fit into a single Hailo-8L and is therefore divided into multiple contexts. Run the following command to confirm:

hailortcli parse-hef model.hef

With multi context models you can increase the throughput at the cost of latency by running batches of images. That will reduce the context switching overhead. e.g.,

hailortcli run model.hef --batch-size 2
hailortcli run model.hef --batch-size 4
hailortcli run model.hef --batch-size 8

Let me know what results you get.

This post might be an interesting read as well.

Hailo Community - My model runs slower than expected

The model indeed consists of 10 contexts.

To get better performance I should thus either use a smaller model, use 4bit quantization or use a smaller input size?

Any other tips on how to reduce the number of contexts?

Network group name: model, Multi Context - Number of contexts: 10
    Network name: model/model
        VStream infos:
            Input  model/input_layer1 UINT8, NHWC(320x320x1)
            Output model/conv68 UINT8, NHWC(40x40x1)
            Output model/conv70 UINT8, NHWC(1x40x1)
            Output model/conv69 UINT8, NHWC(40x40x1)

What is the number of contexts based on, is it purely number of parameters or are there other factors involved?

In general, how would you choose a computer vision backbone to fit in a single context?

How can I force the compiler to focus on a single context minimum latency model only used with a batch size of 1?

hailo compiler --hw-arch hailo8l model_optimized.har

The number of contexts is based on the three resources (compute, memory and control) we have on the Hailo device.

You can find this information at the end of the compiler output and the profiler report.

The compiler will try this per default. However with the Performance Mode the compiler will try harder. This will require significant longer time to complete. Please check the Hailo Dataflow Compiler User Guide performance_param.

Setting the compiler_optimization_level to max helps a lot.

performance_param(compiler_optimization_level=max)

It does however still optimize for maximum FPS, and not minimum latency.

[info] Resources optimization guidelines: Strategy -> GREEDY Objective -> MAX_FPS

How would you set the objective to MIN_LATENCY ?