Where to quantize inputs

I’ve got a custom model trained in pytorch, which I’m wanting to run on the Hailo 8l on a raspberry pi using C++. I think I’m very close to getting it working, but I’m stuck with the format of the inputs.

I’ve followed the async_infer_basic_example. When I load my hef, the input format type is HAILO_FORMAT_TYPE_UINT8, and the HAILO_FORMAT_FLAGS_NONE. When I parsed and optimised the model, i did it with float32s. My confusion is that the flags suggest that HailoRT will perform the quantization for me, but if that’s the case then the format type should be float32. When I try and pass in float 32s, i get an error that the input buffer is too large (by x4 of course).

But I’ve also tried doing the quantization myself using the quant_info on the inputs. The model runs and I get a result, but it’s way off what it should be (way more than just a simple error).

So my guess is that I either (a) need to configure something so that I can pass float32 to the model inputs, or (b) change it so that the model is expecting me to do teh quantize first and then pass in a uint8. I’ve not figured out how to do either of these and I’m not sure if it’s at the DFC stage or the HailoRT stage. I’ve been looking through the docs, examples and this forum, but not figured it out yet. Would appreciate any tips! Thanks!

To add some more detail here, I have also verified that the quantized model works well using the dataflow compiler inference tools, so I think it’s something I’m not understanding about the C++ API rather than an issue generating the HEF

Hey @geoff

Welcome to the Hailo Community!

The mismatch you’re seeing is likely due to the input format defined in your HEF file. If it’s set to HAILO_FORMAT_TYPE_UINT8, the API expects quantized uint8 data, not float32. Here are a few ways to address this:

  1. Automatic Quantization: If you prefer HailoRT to handle quantization, you can configure your input stream to use FLOAT32 format when setting up virtual streams:
hailort::InputVStreamParams input_params;
input_params.format.type = HAILO_FORMAT_TYPE_FLOAT32;

This allows you to pass float32 data directly, and HailoRT will quantize it using the qp_scale and qp_zp values from your HEF file.
2. Manual Quantization: If you want to quantize inputs yourself, use the qp_scale and qp_zp parameters from your HEF file:

for (size_t i = 0; i < buffer_size; i++) {
    quantized_buffer[i] = (uint8_t)((input_buffer[i] / qp_scale) + qp_zp);
}
  1. Output Dequantization: If you need float32 results from quantized output:
for (size_t i = 0; i < output_size; i++) {
    dequantized_output[i] = (float32_t)(quantized_output[i] - qp_zp) * qp_scale;
}

Remember, when using async inference, ensure your buffer format matches the expected type (uint8 or float32).

The key is to either configure your model to handle float32 inputs through HailoRT’s virtual streams, or manually handle quantization if uint8 input is expected.

If you need more details on C++ API usage or setting up virtual streams, feel free to ask. Good luck with your implementation!

Thank you so much, that all makes sense and I’ve got it working now. This is very cool, inference on my model has gone from about 1 second with torchscript on the CPU to 20/30ms on the hailo… so happy :slight_smile: