Hi,
I’m trying to compile an ONNX network into an HEF model for the 8L. I’m mostly following the tutorials, specifically “DFC_2_Model_Optimization_Tutorial”.
I can successfully run inference both in emulation and on hardware. However, I can’t seem to reconcile the network output of the the .har archive (run with the ClientRunner
) with the .hef
file on the chip.
The issue is that I can’t seem to run the ClientRunner inference in a way which returns the same uint8 values as the HEF. The ClientRunner always returns float32s, irrespective of whether I set the InferenceContext
to SDK_QUANTIZED
or not. Is this by design?
The ONNX model accepts a single input tensor (a 1 channel depth image), and has two outputs.
Here’s the steps I’ve taken.
ONNX-to-HAR conversion:
runner = ClientRunner(hw_arch="hailo8l")
runner.translate_onnx_model("model.onnx", ...)
runner.save_har("hailo8l.har")
Optimization:
# dset is a float32 numpy array of shape (2000,192,256,1), normalized to [0,1]
calib_dset = load_calibration_dataset()
runner = ClientRunner(har=str(har_path))
# I skip adding a model script
# runner.load_model_script(alls_str)
runner.optimize(calib_dset)
runner.save_har("hailo8l_quant.har")
Compilation:
def compile_har(model_path: Path, out: Path):
print(f"Compiling HAR at {model_path}")
runner = ClientRunner(har=str(model_path))
hef = runner.compile()
out.write_bytes(hef)
return out
Emulation Inference:
# NOTE: `img` is a single image from the calibration dataset: (1,192,256,1)
runner = ClientRunner(har="hailo8l_quant.har")
with runner.infer_context(InferenceContext.SDK_QUANTIZED) as ctx:
conv_out, cls_out = self._runner.infer(ctx, img)
return conv_out, cls_out
For the actual HEF inference, I copied the code almost exactly from “DFC_4_Inference_Tutorial”.
I’ve tried various permutations of: InferenceContexts in emulation, normalization params in optimization, and changing VStreamParams format_type
in the hef inference.
I’m new to quantized networks, so I don’t know if this is something I’m missing, or this is intended behavior of the system. If it’s intended, I’m curious how I can validate the actual accuracy degradation of the quantization process since the emulated outputs have completely different domains from the actual outputs; can I just log-softmax them and calculate the loss that way?