reconciling different outputs between quantized HAR and compiled HEF

Hi,

I’m trying to compile an ONNX network into an HEF model for the 8L. I’m mostly following the tutorials, specifically “DFC_2_Model_Optimization_Tutorial”.

I can successfully run inference both in emulation and on hardware. However, I can’t seem to reconcile the network output of the the .har archive (run with the ClientRunner) with the .hef file on the chip.

The issue is that I can’t seem to run the ClientRunner inference in a way which returns the same uint8 values as the HEF. The ClientRunner always returns float32s, irrespective of whether I set the InferenceContext to SDK_QUANTIZED or not. Is this by design?

The ONNX model accepts a single input tensor (a 1 channel depth image), and has two outputs.

Here’s the steps I’ve taken.

ONNX-to-HAR conversion:

runner = ClientRunner(hw_arch="hailo8l")
runner.translate_onnx_model("model.onnx", ...)
runner.save_har("hailo8l.har")

Optimization:

# dset is a float32 numpy array of shape (2000,192,256,1), normalized to [0,1]
calib_dset = load_calibration_dataset()
runner = ClientRunner(har=str(har_path))

# I skip adding a model script
# runner.load_model_script(alls_str)

runner.optimize(calib_dset)
runner.save_har("hailo8l_quant.har")

Compilation:

def compile_har(model_path: Path, out: Path):
    print(f"Compiling HAR at {model_path}")

    runner = ClientRunner(har=str(model_path))
    hef = runner.compile()

    out.write_bytes(hef)

    return out

Emulation Inference:

# NOTE: `img` is a single image from the calibration dataset: (1,192,256,1)

runner = ClientRunner(har="hailo8l_quant.har")
with runner.infer_context(InferenceContext.SDK_QUANTIZED) as ctx:
    conv_out, cls_out = self._runner.infer(ctx, img)
    return conv_out, cls_out

For the actual HEF inference, I copied the code almost exactly from “DFC_4_Inference_Tutorial”.

I’ve tried various permutations of: InferenceContexts in emulation, normalization params in optimization, and changing VStreamParams format_type in the hef inference.

I’m new to quantized networks, so I don’t know if this is something I’m missing, or this is intended behavior of the system. If it’s intended, I’m curious how I can validate the actual accuracy degradation of the quantization process since the emulated outputs have completely different domains from the actual outputs; can I just log-softmax them and calculate the loss that way?

Hey @Eric_Schwarzenbach ,

Welcome to the community!

The behavior you’re seeing is totally expected. Let me break down why it happens and how to make sure the outputs from ClientRunner line up with what you get when running the actual .hef on the Hailo-8L.

The ClientRunner Always Returns float32,this is on purpose:

  • Even if you use InferenceContext.SDK_QUANTIZED, the infer() output is still in dequantized float32, not raw uint8.

  • That’s because SDK_QUANTIZED tells the SDK to simulate quantized inference internally — but it converts the output back to float32 using the quantization params (scale and zero-point).

  • This makes it much easier to check accuracy, which is what you’re trying to do.


To make a fair comparison between what you get from the emulator and the hardware:

  1. Dequantize the hardware output (which is uint8).
  2. Or quantize the emulator’s float32 output so they match byte-for-byte.

Here’s how to do both.


Option A: Dequantize the Hardware Output

When you run inference on the .hef (using hailort, TAPPAS, etc.), the output is in raw uint8.

To make it comparable:

def dequantize_uint8_output(output_uint8, scale, zero_point):
    return (output_uint8.astype(np.float32) - zero_point) * scale

To get scale and zero_point, grab the quantization info:

runner = ClientRunner(har='hailo8l_quant.har')
quant_info = runner.get_quantization_info()
print(quant_info)

Option B: Quantize the Emulated Output

If you’d rather make the emulator’s float output match the hardware uint8:

def quantize_float32_output(output_float, scale, zero_point):
    return np.clip(np.round(output_float / scale + zero_point), 0, 255).astype(np.uint8)

Same thing — pull scale and zero_point from runner.get_quantization_info().


To check accuracy drop ,Here’s what works best: run both SDK_FLOAT and SDK_QUANTIZED emulations, then compare.

with runner.infer_context(InferenceContext.SDK_FLOAT) as ctx:
    float_outs = runner.infer(ctx, calib_dset)

with runner.infer_context(InferenceContext.SDK_QUANTIZED) as ctx:
    quant_outs = runner.infer(ctx, calib_dset)

Thanks for your reply!

Dequantizing the hardware output was exactly what I needed. I’ve now gotten the to point where I can compare the emulated output with the hardware (both outputting float32s), so I can compare them.

However, now I’m seeing a pretty considerable accuracy drop from my SDK_QUANTIZED emulated output to the .hef hardware output.

For context: to measure the accuracy drop, I’m comparing both the raw logit and softmaxed outputs from the emulated and hef outputs and diffing them with the original ONNX model’s logit/softmax output (run via onnxruntime), respectively. I’m using 100 arbitrary input images (and their training labels) from a validation dataset to run this comparison.

hardware inference error:
  Logits vs ONNX logits:    0.8343
  Softmax vs ONNX softmax:  0.0103
  Softmax vs label:         0.0152

emulated quantized inference error:
  Logits vs ONNX logits:    0.214
  Softmax vs ONNX softmax:  0.0017
  Softmax vs label:         0.0077

emulated floating-point inference error:
  Logits vs ONNX logits:    0.0013
  Softmax vs ONNX softmax:  0.0
  Softmax vs label:         0.0074

For the error above, I’m doing a simple mean error calculation:

logit_error = np.mean(np.abs(base_out - comparison_out))
softmax_error = np.mean(np.abs(softmax(base_out) - softmax(comparison_out)))

From my understanding, the emulated floating-point and emulated quantized outputs are behaving as expected: effectively no degradation from the onnxruntime output to FP emulated, and an expected decrease in accuracy once quantized. What’s surprising me is that the hardware inference is performing substantially worse than in emulation.

Is this expected? And are there any good ways to debug this, and/or improve the .hef performance? The emulated quantized accuracy is currently within the accuracy needed for our use-case, but the hef is not. So if we can shore up that difference we’d be all set.

Thanks again,
Eric