reconciling different outputs between quantized HAR and compiled HEF

Eric_Schwarzenbach · April 9, 2025, 5:52pm

Hi,

I’m trying to compile an ONNX network into an HEF model for the 8L. I’m mostly following the tutorials, specifically “DFC_2_Model_Optimization_Tutorial”.

I can successfully run inference both in emulation and on hardware. However, I can’t seem to reconcile the network output of the the .har archive (run with the ClientRunner) with the .hef file on the chip.

The issue is that I can’t seem to run the ClientRunner inference in a way which returns the same uint8 values as the HEF. The ClientRunner always returns float32s, irrespective of whether I set the InferenceContext to SDK_QUANTIZED or not. Is this by design?

The ONNX model accepts a single input tensor (a 1 channel depth image), and has two outputs.

Here’s the steps I’ve taken.

ONNX-to-HAR conversion:

runner = ClientRunner(hw_arch="hailo8l")
runner.translate_onnx_model("model.onnx", ...)
runner.save_har("hailo8l.har")

Optimization:

# dset is a float32 numpy array of shape (2000,192,256,1), normalized to [0,1]
calib_dset = load_calibration_dataset()
runner = ClientRunner(har=str(har_path))

# I skip adding a model script
# runner.load_model_script(alls_str)

runner.optimize(calib_dset)
runner.save_har("hailo8l_quant.har")

Compilation:

def compile_har(model_path: Path, out: Path):
    print(f"Compiling HAR at {model_path}")

    runner = ClientRunner(har=str(model_path))
    hef = runner.compile()

    out.write_bytes(hef)

    return out

Emulation Inference:

# NOTE: `img` is a single image from the calibration dataset: (1,192,256,1)

runner = ClientRunner(har="hailo8l_quant.har")
with runner.infer_context(InferenceContext.SDK_QUANTIZED) as ctx:
    conv_out, cls_out = self._runner.infer(ctx, img)
    return conv_out, cls_out

For the actual HEF inference, I copied the code almost exactly from “DFC_4_Inference_Tutorial”.

I’ve tried various permutations of: InferenceContexts in emulation, normalization params in optimization, and changing VStreamParams format_type in the hef inference.

I’m new to quantized networks, so I don’t know if this is something I’m missing, or this is intended behavior of the system. If it’s intended, I’m curious how I can validate the actual accuracy degradation of the quantization process since the emulated outputs have completely different domains from the actual outputs; can I just log-softmax them and calculate the loss that way?

omria · April 16, 2025, 1:42am

Hey @Eric_Schwarzenbach ,

Welcome to the community!

The behavior you’re seeing is totally expected. Let me break down why it happens and how to make sure the outputs from ClientRunner line up with what you get when running the actual .hef on the Hailo-8L.

The ClientRunner Always Returns float32,this is on purpose:

Even if you use InferenceContext.SDK_QUANTIZED, the infer() output is still in dequantized float32, not raw uint8.
That’s because SDK_QUANTIZED tells the SDK to simulate quantized inference internally — but it converts the output back to float32 using the quantization params (scale and zero-point).
This makes it much easier to check accuracy, which is what you’re trying to do.

To make a fair comparison between what you get from the emulator and the hardware:

Dequantize the hardware output (which is uint8).
Or quantize the emulator’s float32 output so they match byte-for-byte.

Here’s how to do both.

Option A: Dequantize the Hardware Output

When you run inference on the .hef (using hailort, TAPPAS, etc.), the output is in raw uint8.

To make it comparable:

def dequantize_uint8_output(output_uint8, scale, zero_point):
    return (output_uint8.astype(np.float32) - zero_point) * scale

To get scale and zero_point, grab the quantization info:

runner = ClientRunner(har='hailo8l_quant.har')
quant_info = runner.get_quantization_info()
print(quant_info)

Option B: Quantize the Emulated Output

If you’d rather make the emulator’s float output match the hardware uint8:

def quantize_float32_output(output_float, scale, zero_point):
    return np.clip(np.round(output_float / scale + zero_point), 0, 255).astype(np.uint8)

Same thing — pull scale and zero_point from runner.get_quantization_info().

To check accuracy drop ,Here’s what works best: run both SDK_FLOAT and SDK_QUANTIZED emulations, then compare.

with runner.infer_context(InferenceContext.SDK_FLOAT) as ctx:
    float_outs = runner.infer(ctx, calib_dset)

with runner.infer_context(InferenceContext.SDK_QUANTIZED) as ctx:
    quant_outs = runner.infer(ctx, calib_dset)

Eric_Schwarzenbach · April 17, 2025, 3:37pm

Thanks for your reply!

Dequantizing the hardware output was exactly what I needed. I’ve now gotten the to point where I can compare the emulated output with the hardware (both outputting float32s), so I can compare them.

However, now I’m seeing a pretty considerable accuracy drop from my SDK_QUANTIZED emulated output to the .hef hardware output.

For context: to measure the accuracy drop, I’m comparing both the raw logit and softmaxed outputs from the emulated and hef outputs and diffing them with the original ONNX model’s logit/softmax output (run via onnxruntime), respectively. I’m using 100 arbitrary input images (and their training labels) from a validation dataset to run this comparison.

hardware inference error:
  Logits vs ONNX logits:    0.8343
  Softmax vs ONNX softmax:  0.0103
  Softmax vs label:         0.0152

emulated quantized inference error:
  Logits vs ONNX logits:    0.214
  Softmax vs ONNX softmax:  0.0017
  Softmax vs label:         0.0077

emulated floating-point inference error:
  Logits vs ONNX logits:    0.0013
  Softmax vs ONNX softmax:  0.0
  Softmax vs label:         0.0074

For the error above, I’m doing a simple mean error calculation:

logit_error = np.mean(np.abs(base_out - comparison_out))
softmax_error = np.mean(np.abs(softmax(base_out) - softmax(comparison_out)))

From my understanding, the emulated floating-point and emulated quantized outputs are behaving as expected: effectively no degradation from the onnxruntime output to FP emulated, and an expected decrease in accuracy once quantized. What’s surprising me is that the hardware inference is performing substantially worse than in emulation.

Is this expected? And are there any good ways to debug this, and/or improve the .hef performance? The emulated quantized accuracy is currently within the accuracy needed for our use-case, but the hef is not. So if we can shore up that difference we’d be all set.

Thanks again,
Eric

Eric_Schwarzenbach · April 29, 2025, 5:08pm

Just following up on this. Any idea how to debug this issue?

nina-vilela · May 19, 2025, 2:40pm

hey @Eric_Schwarzenbach,

A big diff between quantized har and HEF is indeed very weird.

Are you using a normalized dataset for the inference too? How did you load the images? If with OpenCV it’s good to ensure that the bgr to rgb conversion was performed before inference.

If what I mentioned is not what’s causing the issue, it would be helpful if you could share your inference code

Eric_Schwarzenbach · May 20, 2025, 1:45pm

Thanks for reaching out!

The images used for inference with the quantized har and HEF are both normalized to [0,1], using the same normalization parameters as the calibration dataset (during calibration, we also use the pre-normalized [0,1] images).

All the image datasets are loaded from disk via np.load.

Image format is a 1-channel depth image (192x256x1), so there’s no color conversion happening. The images used for inference are 100 randomly sampled images from our training labels. We run the entire dataset through our forward pass in a single batch (100x192x256x1).

The inference code for the quantized har:

class EmulatedHailoInference:
    def __init__(
        self, runner: ClientRunner, ctx: InferenceContext = InferenceContext.SDK_QUANTIZED
    ):
        self._runner = runner
        self._default_ctx = ctx

    def infer(self, img: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
        with self._runner.infer_context(self._default_ctx) as ctx:
            img_out, cls_out = self._runner.infer(ctx, img)
            return img_out, cls_out
...
emulated = EmulatedHailoInference(ClientRunner(har="hailo8l_quant.har", ctx=InferenceContext.SDK_QUANTIZED)
emulated.infer(validation_data)

And the HEF inference:

class HardwareInference:
    def __init__(self, hef: HEF):
        self._hef = hef

        params = VDevice.create_params()
        params.scheduling_algorithm = HailoSchedulingAlgorithm.NONE

        self.device = VDevice(params=params)

        self.configure_params = ConfigureParams.create_from_hef(
            hef=hef, interface=HailoStreamInterface.PCIe
        )
        self.network_groups = self.device.configure(hef, self.configure_params)
        self.network = self.network_groups[0]
        self.net_params = self.network.create_params()

        self.input_vstreams_params = InputVStreamParams.make(
            self.network, format_type=FormatType.FLOAT32
        )
        self.output_vstreams_params = OutputVStreamParams.make(
            self.network, format_type=FormatType.FLOAT32
        )

        self.all_input_vstream_infos = hef.get_input_vstream_infos()
        self.all_output_vstream_infos = hef.get_output_vstream_infos()

        self.input_info = self.all_input_vstream_infos[0]
        self.output_img_info = self.all_output_vstream_infos[0]
        self.output_cls_info = self.all_output_vstream_infos[1]

    def infer(self, img: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
        with InferVStreams(
            self.network, self.input_vstreams_params, self.output_vstreams_params
        ) as infer_pipeline:
            with self.network.activate(self.net_params):
                infer_results = infer_pipeline.infer({self.input_info.name: img})

                img_out = infer_results[self.output_img_info.name]
                cls_out = infer_results[self.output_cls_info.name]
                return img_out, cls_out
...

hardware = HardwareInference(hef=HEF("hailo8l.hef"))
outputs = hardware.infer(validation_data)

If there’s anything else I can share to help just let me know!

Thanks again,
Eric

nina-vilela · May 20, 2025, 2:18pm

I see you set the input format type to float32. Are you converting the images to FLOAT32 for the hardware inference?

nina-vilela · May 21, 2025, 9:00am

You probably do it when applying the normalization, but it is good to be sure the input is in the correct format type.

Do you have a reason to keep the normalization on the host? Offloading it to Hailo’s nncore would save your cpu and simplify the process, eliminating some of the possible issues along the way.

Eric_Schwarzenbach · May 21, 2025, 6:50pm

The main reason for pre-normalizing our images is that it has given us the best performance thus far.

All of the images used for both calibration and inference are pre-normalized to [0,1] np.float32s. This is because we’re just pulling them from our training labels, which are already normalized to this format.

I’ve tried different things with the calibration step, all of which performed worse (on both the quantized HAR and HEF) than using our pre-normalized images.

Normalization on chip

Our un-normalized images are uint16s with a functional range of [0,5000], with mean=1220.131 and std=1401.123. I tried calibrating with the raw images (cast to float32), with a model script as follows:

normalization1 = normalization([1220.131],[1401.123])
model_optimization_flavor(optimization_level=2, compression_level=0)

Then during inference, I passed in our raw images (float32s in the range [0,5000]) to the quantized HAR and the HEF. Both the quantized HAR and HEF performed considerably worse than prior–although they performed similarly to each other. (The HEF input/output streams still set to FLOAT32.)

Calibrated on raw images [0,5000]:

ONNX model top-1 accuracy benchmark: 0.9954

HEF:
    Softmax error:  0.0297
    Top-1 error:    0.0728
    Top-1 acc:      0.9272

quantized HAR:
    Softmax error:  0.0267
    Top-1 error:    0.0702
    Top-1 acc:      0.9298

un-optimized floating-point HAR:
    Softmax error:  0.0074
    Top-1 error:    0.0046
    Top-1 acc:      0.9954

quantization_params:
  input: {'model_29/input_layer1': {'scale': 19.60784339904785, 'zero': 0.0}}
  output: {
    'model_29/conv36': {'scale': 0.2638823091983795, 'zero': 140.0},
    'model_29/fc1': {'scale': 0.21556895971298218, 'zero': 251.0}
  }

Compared to the performance when calibrated on pre-normalized images (without the normalization model script commend):

ONNX model top-1 accuracy benchmark: 0.9954

HEF:
    Softmax error: 0.016
    Top-1 error:   0.0262
    Top-1-acc:     0.9738

quantized HAR:
    Softmax error: 0.0076
    Top-1 error:   0.0055
    Top-1 acc:     0.9945

un-optimized floating-point HAR:
    Softmax error: 0.0074
    Top-1 error:   0.0046
    Top-1 acc:     0.9954

quantization_params:
  input: {'model_29/input_layer1': {'scale': 0.003921568859368563, 'zero': 0.0}}
  output: {
    'model_29/conv36': {'scale': 0.09513326734304428, 'zero': 175.0},
    'model_29/fc1': {'scale': 0.3704061806201935, 'zero': 225.0}
  }

I would not be surprised if I’m doing something obviously wrong here. Is this the preferred method for pushing the normalization to the hailo chip?

nina-vilela · May 22, 2025, 11:40am

You mentioned your training, inference, and calibration dataset are in the range [0,1], but using normalization([1220.131], [1401.123]) means you’re subtracting 1220.131 and dividing by 1401.123, so your data ends up roughly in the range [-0.87, 2.69].

Just make sure you’re using the same mean and std you used during training so everything stays consistent.

Eric_Schwarzenbach · May 22, 2025, 2:33pm

Okay that makes sense. Thanks for clarifying that.

Our current normalization scheme for training is a simple min-max scaling with clipping (x / 5000, clipped to [0,1]). (maybe irrelevant, but we clip because our sensor’s accuracy becomes unreliable past 5 meters, and ignoring all information from that distance improves performance).

So replicating that with the normalization layer in the model script would result in this command normalization([0],[5000])? (I don’t know how to impose clipping with the model script so I clip the images manually prior to calibration and inference)

Rerunning calibration with that normalization command results in effectively identical results as with host normalization on all the stages (FP HAR, quant HAR, HEF), with the quantized HAR performing well but the HEF performing poorly.

HEF:
  Top-1 acc:     0.9747,
  Top-1 error:   0.0253,
  Softmax error: 0.0153
  vs ONNX output:
    Logits:  0.8326.
    Softmax: 0.0105,
    Argmax:  0.0246,

quant HAR:
  Top-1 acc:     0.9946,
  Top-1 error:   0.0054,
  Softmax error: 0.0076
  vs ONNX output:
    Logits:  0.2105.
    Softmax: 0.0017,
    Argmax:  0.0026,

optimized HAR:
  Top-1 acc:     0.9954,
  Top-1 error:   0.0046,
  Softmax error: 0.0074
  vs ONNX output:
    Logits:  0.0013.
    Softmax: 0.0,
    Argmax:  0.0,

Notably, I do get this warning during optimization, calibrating with the raw images with the range [0,5000].

[warning] The expected calibration set should be [0, 255] when using an in-net normalization layer, but the range received is [(0.0, 5000.0)].

nina-vilela · May 25, 2025, 1:29pm

Your normalization command looks correct. I’ll send you a link via PM so you can share the files, I’ll take a look and verify that everything is in order with the compilation.

Hit_ASA · June 4, 2025, 1:58am

@nina-vilela

I am experiencing the same issue, so I would appreciate it if you could post any updates to this thread.

Regards.

Topic		Replies	Views
Poor Inference Results of My MobileNetV2-UNet on Hailo8L General	10	112	February 27, 2025
Custom ONNX models on H8L Raspberry General dfc , raspberry-pi , hailo8	3	2043	July 30, 2024
The results of simulator are different from onnx General dfc , hailo8 , error	3	159	December 17, 2024
How can I compile yolov5m and my own yolov5 onnx to hef without quantization, optimization and nms General dfc , hailo8	13	829	April 17, 2025
Unable to compile Quantized HAR file to HEF for YOLOv8n model General	8	492	August 26, 2024

reconciling different outputs between quantized HAR and compiled HEF

Option A: Dequantize the Hardware Output

Option B: Quantize the Emulated Output

Normalization on chip

Related topics