ConvNext on Hailo8 Non-Deterministic HEF Results

Non-Deterministic HEF Results on Hardware Despite Perfect HAR Emulation

I’m experiencing catastrophic non-deterministic failures when running a quantized ConvNeXt model on Hailo-8 hardware, despite both FP32 and quantized HAR files producing perfect results in SDK emulation. The same input image produces completely different predictions on each run, suggesting memory corruption or uninitialized buffers in the HEF runtime.

System Information

Development Machine (Compilation):

  • Hailo Dataflow Compiler: v3.31.0
  • Python: 3.10
  • OS: Ubuntu 24.04 LTS

Target Hardware:

  • Hailo-8 AI Module (26 TOPS)
  • HailoRT: v4.19.0
  • Platform: Raspberry Pi 5
  • Interface: PCIe

Model:

  • Architecture: ConvNeXt Small with SE blocks
  • Input: 224x224x3 RGB images
  • Output: 67 classes
  • Framework: PyTorch → ONNX (opset 11) → Hailo

The Issue

:white_check_mark: HAR Emulation Works Perfectly

Both FP32 and quantized HAR files produce deterministic, correct results in SDK emulation:

FP32 HAR (SDK_NATIVE):

Run 1: Class 1 at 99.95%
Run 2: Class 1 at 99.95%
Run 3: Class 1 at 99.95%

Quantized HAR (SDK_QUANTIZED):

Run 1: Class 1 at 99.97%
Run 2: Class 1 at 99.97%
Run 3: Class 1 at 99.97%

:cross_mark: HEF on Hardware Completely Fails

The compiled HEF produces non-deterministic, incorrect results on Hailo-8 hardware:

Same image, multiple runs:

Run 1: Class 17 at 19.0%
Run 2: Class 0 at 3.5%
Run 3: Class 2 at 7.4%
Run 4: Class 34 at 58.4%
Run 5: Class 19 at 10.5%

Expected result: Class 1 at ~99.97% (matches ONNX and HAR emulation)

Reproducible Test Code

HAR Emulation Test (Works :white_check_mark:)

import numpy as np
import cv2
from hailo_sdk_client import ClientRunner, InferenceContext

# Preprocess image
image = cv2.imread('test_image.jpg')
rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
resized = cv2.resize(rgb, (224, 224))
input_data = np.expand_dims(resized.astype(np.float32), axis=0)

# Test quantized HAR
runner = ClientRunner(har='model_quantized.har')
with runner.infer_context(InferenceContext.SDK_QUANTIZED) as ctx:
    output = runner.infer(ctx, input_data, batch_size=1)

# Results: Class 1 at 99.97% ✅

HEF Hardware Test (Fails :cross_mark:)

import numpy as np
import cv2
import hailo_platform as hpf

# Preprocess image
image = cv2.imread('test_image.jpg')
rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
resized = cv2.resize(rgb, (224, 224))
input_image = resized.astype(np.float32)  # [0-255] for model script normalization

# Load HEF
hef = hpf.HEF('model_quantized.hef')

with hpf.VDevice() as target:
    configure_params = hpf.ConfigureParams.create_from_hef(
        hef, interface=hpf.HailoStreamInterface.PCIe
    )
    network_group = target.configure(hef, configure_params)[0]
    network_group_params = network_group.create_params()

    input_vstream_info = hef.get_input_vstream_infos()[0]
    output_vstream_info = hef.get_output_vstream_infos()[0]

    input_vstreams_params = hpf.InputVStreamParams.make_from_network_group(
        network_group, quantized=False, format_type=hpf.FormatType.FLOAT32
    )
    output_vstreams_params = hpf.OutputVStreamParams.make_from_network_group(
        network_group, quantized=False, format_type=hpf.FormatType.FLOAT32
    )

    with network_group.activate(network_group_params):
        with hpf.InferVStreams(network_group, input_vstreams_params, output_vstreams_params) as infer_pipeline:
            input_data = {input_vstream_info.name: np.expand_dims(input_image, axis=0)}
            results = infer_pipeline.infer(input_data)
            output = results[output_vstream_info.name]

            # Results: Non-deterministic random classes ❌

Model Script Configuration

I’m using a model script for on-chip ImageNet normalization:

imagenet_normalization.alls:

# ImageNet normalization converted to Hailo format
# PyTorch: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
# Hailo: mean * 255, std * 255
input_normalization = normalization([123.675, 116.28, 103.53], [58.395, 57.12, 57.375])

Compilation command:

hailo parser onnx model.onnx --har model.har
hailo optimize --calib-set-path calibration.npy \
    --model-script imagenet_normalization.alls \
    --output-har-path model_quantized.har
hailo compiler model_quantized.har

Calibration data: Raw [0-255] FLOAT32 pixels (1500 samples)

What I’ve Ruled Out

:white_check_mark: API usage - Tested with proper context managers and network_group_params
:white_check_mark: VStream parameters - Tested all combinations of quantized=True/False, FLOAT32/UINT8
:white_check_mark: Model script - Build logs confirm input_normalization layer is compiled and connected
:white_check_mark: Calibration data - Verified correct [0-255] range, proper shape (1500, 224, 224, 3)
:white_check_mark: ONNX model - Produces perfect deterministic results on CPU
:white_check_mark: Quantization - HAR emulation shows quantization is correct

Build Logs

No errors or warnings during compilation. The input_normalization layer is present in the compiled HEF:

training_convnext_small_se_balanced_context_0:
  input_layer1, input_normalization, conv1, ...

Layer connections:
  input_layer1 → input_normalization → conv1_defuse_width_feature_reshape

Questions

  1. Why does HAR emulation work perfectly but HEF hardware fails? The quantized HAR produces deterministic Class 1 at 99.97% in SDK_QUANTIZED emulation, but the compiled HEF produces random, non-deterministic results on actual hardware.

  2. Is this a known issue with ConvNeXt models or LayerNorm layers? The model uses LayerNorm extensively, which might have compatibility issues.

  3. Could this be a DFC v3.31.0 bug?

  4. Is model script normalization actually executing during HEF inference? Despite being compiled into the HEF, diagnostic tests suggest the normalization layer may not be executing on hardware.

  5. What debugging steps can I take? Is there a way to inspect HEF execution at runtime or verify that preprocessing layers are actually running?

Additional Evidence

I created comprehensive diagnostic scripts that prove:

  • ONNX baseline: Deterministic Class 1 at 99.97% :white_check_mark:
  • FP32 HAR emulation: Deterministic Class 1 at 99.95% :white_check_mark:
  • Quantized HAR emulation: Deterministic Class 1 at 99.97% :white_check_mark:
  • HEF on hardware: Non-deterministic random classes :cross_mark:

The non-determinism (different results on identical inputs) strongly suggests memory corruption or uninitialized buffers in the HEF runtime.

Thank you for your help!

Hailo SW products are compatible with each other on specific versions. Your Hailo Dataflow Compiler and HailoRT version combination is not validated. Please have a look at the Hailo Ai Software Suite version compatibility. I do not know whether that is causing the issue but you should test this.

Developer Zone - Hailo AI Software Suite Versions compatibility

Yes, it should.

The HEF file is more like an FPGA image than code running on a CPU. The layers all work independently and at the same time. Data flows from one layer to another. There is no decision whether a layer runs or not. You can simply test your hardware using some ready made examples from Raspberry or Hailo. If a device would have memory corruption, the device would be defective and you would see it when running other HEF files.

Thanks for the software compatibility overview. I overlooked that one. I would indeed expect more explicit errors in case of incompatible packages but it’s worth excluding this as a source of errors.

Tested with latest versions - bug persists

  • DFC 3.33.0 (upgraded from 3.31.0)
  • HailoRT 4.23.0 (upgraded from 4.19.0)
  • Embedded ONNX normalization (no model script)
  • LayerNorm Decomposition optimization enabled

I don’t think its specific to LayerNorm, I’ve experienced this with fast-plate-ocr and also yolov8 (w/ yolov8 the differences aren’t enough to affect the end results but they are not small fp differences like stated in the documentation).

You can see the difference here with fast-plate-ocr. My comment also links to two different topics that also experience the same issue (yolov5 and some unknown network).

did you find a solution or still waiting for one?

Hasn’t been resolved yet.

@KlausK I’ve aligned the whole software stack to match the compatibility matrix you’ve shared. However, the results remain the same, all over the place.

A few points worth checking:

  1. Use a calibration dataset with un-normalized uint images.
  2. Use a newer API for inference. I still suspect some issues with FLOAT/INT vstreams. I’d recommend using this wrapper:

Here’s an example of how to use it: