16 bit quantization

Hey everyone,

I’m working on a project where I need to quantize a neural network to 16-bit. My workflow itself works well, but I’m running into an issue: when I use precision mode to quantize the model to 16-bit, the weights and activations end up in the range 0 to 32,000 (even though I expect them to be in –32,000 to 32,000).
When I use the default INT8 optimization, everything is correctly represented in the range –128 to 127.

Can someone explain why this happens or how I can fix it? Does this have something to do with the zero-point?

Thank you very much,
Lars

Hey @Lars_Hofmann ,

Good question! What you’re seeing—quantized values in the 0 to ~32,000 range when using 16-bit—is expected with Hailo’s quantization. It’s not an issue with the 16-bit mode itself.

Hailo uses unsigned integers for all quantized values. The actual (real) values are recovered using a scale and zero-point (qp_scale, qp_zp). So even though the stored ints are non-negative, the real values can still be negative or symmetric around zero.

Key points:

  • Seeing only positive values in histograms is expected.
  • The real dynamic range depends on the scale and zero-point.
  • You can constrain real output ranges using force_range_out, e.g.:
quantization_param(conv1, force_range_out=[-1.0, 1.0])

This behavior is part of how Hailo’s quantization works—not a bug.

Hope that helps!

Hello @Omria,

Thank you for your quick response.
Our inputs and outputs are as follows:

Inputs:
19 float32 inputs in the range –1 to 1.
Outputs:
Output 1: float32 in the range 0 to 1.1
Output 2: float32 in the range –1.1 to 1.1

Before sending the data, we manually quantized the inputs into a range of 0 to 255, meaning:
0 corresponds to –1
127 corresponds to 0
255 corresponds to 1

After inference, we dequantized the outputs as follows:
Output 1:
0 corresponds to 0
255 corresponds to 1.1

Output 2:
0 corresponds to –1.1
127 corresponds to 0
255 corresponds to 1.1

We want the model to run in 8-bit and ideally also in 16-bit precision.

For running our HEF, we currently use a single input stream and a single output stream. The challenge is that we have two outputs with different value ranges. Does this mean we need separate output streams for each output? If so, how should this be set up?
For calibration, we used realistic simulation data. Based on this, are the zero-point and scaling factors derived automatically?

Below is the architecture of the network:

Hey @Lars_Hofmann,

You do not need to manually handle the 0/127/255 mappings or create multiple vstreams solely because your outputs have different ranges. Here is how it works:

How Hailo Handles Quantization

When you compile your model, the Dataflow Compiler automatically assigns quantization parameters (qp_scale and qp_zp) to each input and output layer in the HEF. The transformations operate as follows:

  • Input (host → device): x_device = x_host / qp_scale + qp_zp
  • Output (device → host): y_host = (y_device - qp_zp) * qp_scale

Handling Different Output Ranges

Each output layer receives its own quantization parameters, so having different ranges such as:

  • Output 1: [0, 1.1]
  • Output 2: [-1.1, 1.1]

is fully supported. The compiler derives separate qp_scale and qp_zp values for each output during calibration. You need separate vstreams only if your HEF contains multiple output tensors—not merely because the ranges differ.

Recommended Approach for inference

Instead of performing manual mapping, configure your vstreams to use the FLOAT32 format:

hailort::InputVStreamParams in_params;
in_params.format.type = HAILO_FORMAT_TYPE_FLOAT32;

hailort::OutputVStreamParams out_params;
out_params.format.type = HAILO_FORMAT_TYPE_FLOAT32;

This allows you to work with floating-point values in their natural ranges, while HailoRT automatically applies the appropriate quantization parameters.

Optional: Forcing Specific Ranges

If you want to enforce a specific output range during quantization, you can use a model-script command:

quantization_param(output_layer_name, force_range_out=[-1.1, 1.1])

If you would like, share the output of hailortcli parse-hef your_model.hef and the results of get_*_vstream_infos()—particularly the quant_info section—so we can examine what may be occurring during inference.

Hope this helps!

Hey @omria,

following the infos you asked for.

hailortcli parse-hef SAC_Agent_8Bit.hef gives:

Quant-Info are:

Input stream size: 1

Input frame size: 24

Input quantization info:

Scale: 0.00976785

Zero point: 114

Output frame size: 8

Output quantization info:

Scale: 0.0326291

Zero point: 130

The functions, which are relevant for inference are the following, at the moment we are using raw streams:

// Initialize Hailo

Init()

{

// Create connection to Hailo-Chip

auto device = Device::create();

if (!device) {

    std::cerr << "Failed to create device " << device.status() << std::endl;

    return device.status();

}



// Load hef-File on Hailo-Chip

//auto network_group = configure_network_group(\*device.value());

auto network_group = this->configure_network_group(\*device.value());

if (!network_group) {

    std::cerr << "Failed to configure network group " << HEF_FILE << std::endl;

    return network_group.status();

}



// Get input and output streams

infer_inputs = network_group->get()->get_input_streams();

infer_outputs = network_group->get()->get_output_streams();



std::cout << "Input stream size: " << infer_inputs.size() << std::endl;



auto &input0 = infer_inputs\[0\].get(); // .get() holt die referenzierte InputStream-Instanz



auto input_info = input0.get_info();

std::cout << "Input frame size: " << input0.get_frame_size() << std::endl;



auto quant_info_in = input0.get_quant_infos();

std::cout << "Input quantization info: " << std::endl;

for (const auto &info : quant_info_in) {

    std::cout << "Scale: " << info.qp_scale << std::endl;

    std::cout << "Zero point: "<<info.qp_zp << std::endl;

}





auto &output0 = infer_outputs\[0\].get();



auto output_info = output0.get_info();

std::cout << "Output frame size: " << output0.get_frame_size() << std::endl;



auto quant_info_out = output0.get_quant_infos();

std::cout << "Output quantization info: " << std::endl;

for (const auto &info : quant_info_out) {

    std::cout << "Scale: " << info.qp_scale << std::endl;

    std::cout << "Zero point: "<<info.qp_zp << std::endl;

}



// Enable communication with Hailo-Chip

auto activated_network_group = network_group.value()->activate();

if (!activated_network_group) {

    std::cerr << "Failed activated network group "  << activated_network_group.status();

    return activated_network_group.status();

}

}

void write_all(InputStream &input, hailo_status &status)

{

// erstellt Übersetzer zwischen Eingabedaten im Host-Speicher und Datenformat, welches Chip braucht

auto transform_context = InputTransformContext::create(input.get_info(), {}, FORMAT_TYPE);

if (!transform_context) {

    status = transform_context.status();

    return;

}



std::vector<uint8_t> host_data(transform_context.value()->get_src_frame_size());

std::vector<uint8_t> hw_data(input.get_frame_size());



// Füllen der Eingabedaten



    host_data.at(0) = wkappa_q;

    host_data.at(1) = v_q;

    host_data.at(2) = ey_q;

    host_data.at(3) = epsi_q;

    host_data.at(4) = wx_q;

    host_data.at(5) = leadobs_q.at(0).at(0);

    host_data.at(6) = leadobs_q.at(0).at(1);

    host_data.at(7) = leadobs_q.at(0).at(2);

    host_data.at(8) = leadobs_q.at(0).at(3);

    host_data.at(9) = leadobs_q.at(1).at(0);

    host_data.at(10) = leadobs_q.at(1).at(1);

    host_data.at(11) = leadobs_q.at(1).at(2);

    host_data.at(12) = leadobs_q.at(1).at(3);

    host_data.at(13) = delayed_actions_q.at(0).at(0);

    host_data.at(14) = delayed_actions_q.at(0).at(1);

    host_data.at(15) = delayed_actions_q.at(1).at(0);

    host_data.at(16) = delayed_actions_q.at(1).at(1);

    host_data.at(17) = delayed_actions_q.at(2).at(0);

    host_data.at(18) = delayed_actions_q.at(2).at(1);



for (size_t i = 0; i < FRAMES_COUNT; i++) {

    status = transform_context.value()->transform(MemoryView(host_data.data(), host_data.size()),

        MemoryView(hw_data.data(), hw_data.size()));

    if (HAILO_SUCCESS != status) {

        return;

    }



    status = input.write(MemoryView(hw_data.data(), hw_data.size()));

    if (HAILO_SUCCESS != status) {

        return ;

    }

}



return;

}

void read_all(OutputStream &output, hailo_status &status)

{

auto transform_context = OutputTransformContext::create(output.get_info(), {}, FORMAT_TYPE);

if (!transform_context) {

    status = transform_context.status();

    return;

}



std::vector<uint8_t> hw_data(output.get_frame_size());

std::vector<uint8_t> host_data(transform_context.value()->get_dst_frame_size());



std::cout << "Bytes für ein Output-Frame im Host-Format: " << transform_context.value()->get_dst_frame_size() << std::endl;

std::cout << "Bytes für ein Output-Frame im HailoChip-Format: " << output.get_frame_size() << std::endl;



for (size_t i = 0; i < FRAMES_COUNT; i++) {

    status = output.read(MemoryView(hw_data.data(), hw_data.size()));

    if (HAILO_SUCCESS != status) {

        return;

    }



    std::cout << "Outputdaten vor Transformation:" << std::endl;

    for (size_t i = 0; i < hw_data.size(); i++) {

        std::cout << static_cast<int>(hw_data\[i\]) << " ";

    }

    std::cout << std::endl;





    status = transform_context.value()->transform(MemoryView(hw_data.data(), hw_data.size()),

        MemoryView(host_data.data(), host_data.size()));

    if (HAILO_SUCCESS != status) {

        return;

    }

}





std::cout << "Outputdaten nach Transformation:" << std::endl;

for (size_t i = 0; i < host_data.size(); i++) {

    std::cout << static_cast<int>(host_data\[i\]) << " ";

}

std::cout << std::endl;







//std::cout << "Output\[0\]: " << static_cast<int>(host_data\[0\]) << std::endl;

//std::cout << "Output\[1\]: " << static_cast<int>(host_data\[1\]) << std::endl;



// save results from Hailo-Chip

      pedals = host_data.at(0);

    steering = host_data.at(1);



    // Absicherung

    if(pedals < 0) pedals = 0;

    if(pedals > 255) pedals = 255;

    if(steering < 0) steering = 0;

    if(steering > 255) steering = 255;



    delayed_actions.at(2).at(0) = delayed_actions.at(1).at(0);

    delayed_actions.at(2).at(1) = delayed_actions.at(1).at(1);

    delayed_actions.at(1).at(0) = delayed_actions.at(0).at(0);

    delayed_actions.at(1).at(1) = delayed_actions.at(0).at(1);

    delayed_actions.at(0).at(0) = pedals;

    delayed_actions.at(0).at(1) = steering;

return;

}

hailo_status infer(InputStreamRefVector &input_streams, OutputStreamRefVector &output_streams)

{

hailo_status status = HAILO_SUCCESS; // Success oriented

hailo_status input_status\[MAX_LAYER_EDGES\] = {HAILO_UNINITIALIZED};

hailo_status output_status\[MAX_LAYER_EDGES\] = {HAILO_UNINITIALIZED};

std::unique_ptr<std::thread> input_threads\[MAX_LAYER_EDGES\];

std::unique_ptr<std::thread> output_threads\[MAX_LAYER_EDGES\];

size_t input_thread_index = 0;

size_t output_thread_index = 0;





// Input threads

for (input_thread_index = 0; input_thread_index < input_streams.size(); input_thread_index++) {

    input_threads\[input_thread_index\] = std::make_unique<std::thread>(

        \[this\](InputStream &is, hailo_status &st) {

            this->write_all(is, st);

        },

        std::ref(input_streams\[input_thread_index\]),

        std::ref(input_status\[input_thread_index\])

    );

}





// Output threads

for (output_thread_index = 0; output_thread_index < output_streams.size(); output_thread_index++) {

    output_threads\[output_thread_index\] = std::make_unique<std::thread>(

        \[this\](OutputStream &os, hailo_status &st) {

            this->read_all(os, st);

        },

        std::ref(output_streams\[output_thread_index\]),

        std::ref(output_status\[output_thread_index\])

    );

}



// Join write threads

for (size_t i = 0; i < input_thread_index; i++) {

    input_threads\[i\]->join();

    if (HAILO_SUCCESS != input_status\[i\]) {

        status = input_status\[i\];

    }

}



// Join read threads

for (size_t i = 0; i < output_thread_index; i++) {

   

    output_threads\[i\]->join();

    if (HAILO_SUCCESS != output_status\[i\]) {

        status = output_status\[i\];

    }

}



if (HAILO_SUCCESS == status) {

    std::cout << "Inference finished successfully" << std::endl;

}



return status;

}

void step()

{

calculate_SAC_inputs();



// Quantization of inputs

quantizeInputs();



// Inference

    auto status = this->infer(infer_inputs, infer_outputs);

  

// Dequantization of outputs

dequantizeOutputs();



// Send carinputs message



std::cout<<"Publish carinputs message!"<<std::endl;

mbmadmsgs::msg::CarInputs carInputsMsg;





carInputsMsg.carid = static_cast<uint8_t>(CarParameters::p()->carid);

carInputsMsg.opmode = opModeFsm.getStateId();

carInputsMsg.pedals = pedals;

carInputsMsg.steering = steering;



if (cpSeq.empty() == false) {

  // avoid invalid deadline measurements when there is no camera data (i.e. during startup)

  cpSeq.update(CheckpointGraph::Checkpoint::CtrlPublish);

  carInputsMsg.cpseq = cpSeq.message();

}

#ifdef XXXXXXXX // too restrictive

if (cpSeq.health().health() == false) {

  // emergency halt

  carInputsMsg.cmd = carInputsMsg.CMD_HALT;

  carInputsMsg.pedals = 0.0F;

  // heal immediately

  cpSeq.health().recover();

  BOOST_LOG_TRIVIAL(error) << "\[ERROR\] no health -> emergency halt";

}

#endif

pubInputs->publish(carInputsMsg);



mbmadmsgs::msg::CtrlReference ctrlReferenceMsg;

//ctrlReferenceMsg.carid = static_cast<uint8_t>(CarParameters::p()->carid);

//ctrlReferenceMsg.s.at(0) = pathController.ws.at(0);

//ctrlReferenceMsg.s.at(1) = pathController.ws.at(1);

//ctrlReferenceMsg.psi = pathController.wpsi;

ctrlReferenceMsg.v = vref;

//ctrlReferenceMsg.x = pathController.wx;

ctrlReferenceMsg.crashctr = lapCounter.getCrashCtr();

ctrlReferenceMsg.lapctr = lapCounter.getLapCtr();

ctrlReferenceMsg.laptime = lapCounter.getLapTime();

ctrlReferenceMsg.avgspeed = lapCounter.getAvgSpeed();

ctrlReferenceMsg.currentlaptime = lapCounter.getCurrentLapTime();

ctrlReferenceMsg.prob = prob;

pubCtrlReference->publish(ctrlReferenceMsg);



std::cout<<"Published carinputs messages!"<<std::endl;





std::cout<<"Step function finished!"<<std::endl;

}

Is it okay for us to use raw streams in our case? If not, how do we need to change it so that it works with vstreams?

Thanks for your help
Lars