I did some personal digging as well. I compiled the hef and also ran emulation for the native model, the optimized floating point model, and the quantized model.
The native model and the optimized floating point model gives the same probits and postprocessing them gets the same answer as an onnx inference. The quantized emulation does not give the same probits but post processing gets the same (correct) answer as an onnx inference.
However, inference using the hef does not give the same output as the emulated quantized model. They are not even close (I know that they won’t be an exact match).
Here is part of the output to demonstrate the issue (last “slot” of the OCR model):
SDK_QUANTIZED Emulation
[0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0.9960784 ]
HEF
[0.1607843 , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0.02745098,
0. , 0.76470584, 0.04313725, 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. ]
Seems like this discrepancy from emulation and hef is not specific to this model:
Both these threads report differences between quantized emulation and hef inference.