Quantization of CNN with multiple Outputs and Fusion Layer

Ko_Si · December 9, 2025, 11:50pm

DFC : v.3.32.0
HailoRT: v4.22.0

Hi Hailo Community,

For weeks I am trying to solve the following problem: I am trying to compile a customized version of Xfeat (Feature extractor). The original architecture is not suitable for hailo quantization but managed using the following steps of my Python Toolchain:

Input: image of dim 1x608x800x1 where H,W are fixed
Output:
- Feature tensor (1x76x100x64) feats → feeds into heatmap
- Keypoints tensor (1x76x100x65) kpts(directly fed by input)
- Heatmap (1x76x100x1)

Sanitize ONNX (add kernel shape attributes to conv layers)
Export sanitized ONNX to HAR and compare to origninal ONNX using test data → Looks good
Floating point optimization of HAR and adding normalization layer as I want to feed uint8 images directly into model using (Results are good)

model_script_commands = [
    "normalization1 = normalization([0.0], [255.0])\n",
]
self._client_runner.load_model_script("".join(model_script_commands))

Optimize (quantize model) with:

Model script commands:
model_optimization_flavor(optimization_level=4, compression_level=0, batch_size=8)
quantization_param(accelerated_features_hailo_inferred_sanitized/output_layer1, precision_mode=a16_w16)
quantization_param(accelerated_features_hailo_inferred_sanitized/output_layer2, precision_mode=a16_w16)
quantization_param(accelerated_features_hailo_inferred_sanitized/output_layer3, precision_mode=a16_w16)
quantization_param(accelerated_features_hailo_inferred_sanitized/conv6, force_range_out=[-10.0, 10.0])
quantization_param(accelerated_features_hailo_inferred_sanitized/avgpool4, force_range_out=[-10.0, 10.0])
quantization_param(accelerated_features_hailo_inferred_sanitized/ew_add1, force_range_out=[-10.0, 10.0])
quantization_param(accelerated_features_hailo_inferred_sanitized/conv13, force_range_out=[-10.0, 10.0])
quantization_param(accelerated_features_hailo_inferred_sanitized/conv17, force_range_out=[-10.0, 10.0])
quantization_param(accelerated_features_hailo_inferred_sanitized/ew_add2, force_range_out=[-10.0, 10.0])
quantization_param(accelerated_features_hailo_inferred_sanitized/conv16, force_range_out=[-10.0, 10.0])
quantization_param(accelerated_features_hailo_inferred_sanitized/conv21, force_range_out=[-10.0, 10.0])
quantization_param(accelerated_features_hailo_inferred_sanitized/ew_add3, force_range_out=[-10.0, 10.0])
model_optimization_config(globals, output_encoding_vector=enabled)
allocator_param(enable_muxer=False)
quantization_param({conv*}, precision_mode=a16_w16)
model_optimization_config(calibration, batch_size=8, calibset_size=256)
pre_quantization_optimization(layer_norm_decomposition, mode=nn_core, eq_consumer=False)
pre_quantization_optimization(activation_clipping, layers={*}, mode=percentile, clipping_values=[0.01, 99.99])
post_quantization_optimization(finetune, policy=enabled, learning_rate=1.0e-4, batch_size=8, epochs=4, dataset_size=2048)

Note that this is a first try and force_range_out=[-10.0, 10.0] on certain layers was chosen on layers that feed into element wise add operations, see this post here. Also I chose per-channel-scales as model_optimization_config(globals, output_encoding_vector=enabled).

Compare original ONNX to quantized HAR → Kind of OK (?)

Here an output of the first 5 values plus mean and max error over entire output tensors:

+-----------+----------------------------------------------------+----------------------------------------------------+-----------+------------+
| Tensor    | ONNX first 5                                       | Hailo first 5                                      | Max diff  | Mean diff  |
+-----------+----------------------------------------------------+----------------------------------------------------+-----------+------------+
| features  | [ 1.79381, -0.31311,  0.31845,  0.27129,  0.2671 ] | [ 1.56761, -0.38633,  0.5146 ,  0.25054,  0.4213 ] | 11.8055   | 0.092895   |
| keypoints | [-1.84859, -3.11717, -1.93641, -2.12238, -1.73115] | [-1.79705, -2.98431, -1.7902 , -1.99766, -1.72035] | 3.40494   | 0.198497   |
| heatmap   | [0.14378, 0.14145, 0.13122, 0.11162, 0.09499]      | [0.14063, 0.13501, 0.11869, 0.10364, 0.09165]      | 0.0654779 | 0.00188323 |
+-----------+----------------------------------------------------+----------------------------------------------------+-----------+------------+

Quantization noise of output layers:

Output layers signal-to-noise ratio (SNR): measures the quantization noise (higher is better)
	accelerated_features_hailo_inferred_sanitized/keypoints SNR:	22.02 dB
	accelerated_features_hailo_inferred_sanitized/heatmap   SNR:	31.8 dB
	accelerated_features_hailo_inferred_sanitized/features  SNR:	21.42 dB

I think everything is still problematic but my focus is on the features. Possibly heatmap looks better than it actually is since the output scales correspond to a sigmoid activation, hence they are in [0.0, 1.0].

Even more problematic: When I infer with my HEF on actual hardware in C++ similarly to the vstreams C++ example (vstreams_example.cpp) I see a significant mismatch between my dequantized output (I leave dequantization to Hailo as I am asking for FLOAT32 in NHWC output format):

HAR (expected)              :  1.567613 -0.386329  0.514596  0.250543  0.421296
HEF feat (first 5 NHWC flat): -0.907798 -0.0109657 0.377234 -0.0336807 0.33477 
HEF feat (first 5 NHWC flat): -0.912227 -0.0194008 0.465392 -0.0357428 0.33477 
HEF feat (first 5 NHWC flat): -0.910013 -0.0151832 0.393635 -0.0329934 0.339921 

HAR (expected)              : -1.797053 -2.984307 -1.790201 -1.997657 -1.720350
HEF kpts (first 5 NHWC flat): -1.79228  -2.97407  -1.77757  -1.98365  -1.71702 
HEF kpts (first 5 NHWC flat): -1.79228  -2.97407  -1.77757  -1.98365  -1.71702 
HEF kpts (first 5 NHWC flat): -1.79228  -2.97407  -1.77757  -1.98365  -1.71702 

HAR (expected)                 : 0.140629 0.135014 0.118686 0.103641 0.091647
HEF Heatmap (first 5 NHWC flat): 0.146641 0.140934 0.140721 0.139775 0.139439
HEF Heatmap (first 5 NHWC flat): 0.14655  0.140477 0.140477 0.139592 0.13947
HEF Heatmap (first 5 NHWC flat): 0.146062 0.140843 0.140507 0.139592 0.139286

My questions:

Would you deem the quantization resuts OK as a first shot? I am doing anything significantly wrong?
I would expect to see very similar results comparing HEF vs quantized HAR outputs - independent of whether the HAR is a good quantization. Side info: I observe similar uint16 outputs but each inference run the HEF uint16 is slightly different for heatmap and features - funnily, not for the keypoints. Is this normal?

In case it helps: output of hailortcli parse-hef:

Architecture HEF was compiled for: HAILO8L
Network group name: accelerated_features_hailo_inferred_sanitized, Multi Context - Number of contexts: 4
    Network name: accelerated_features_hailo_inferred_sanitized/accelerated_features_hailo_inferred_sanitized
        VStream infos:
            Input  accelerated_features_hailo_inferred_sanitized/input_layer1 UINT8, NHWC(608x800x1)
            Output accelerated_features_hailo_inferred_sanitized/conv9 UINT16, FCR(76x100x65)
            Output accelerated_features_hailo_inferred_sanitized/conv27 UINT16, FCR(76x100x1)
            Output accelerated_features_hailo_inferred_sanitized/conv24 UINT16, NHWC(76x100x64)

conv9 ~ keypoints (Format FCR?)
conv27 ~ heatmap (Format FCR?)
conv24 ~ features (Format NHWC?)

Sorry for the long post but I have the feeling this should be possible. But as I am not an export on quantization topics I may have missed some important details and I am definitely at a point at which I need help.

Cheers,
Konrad

omria · December 11, 2025, 10:41am

Hey @Ko_Si,

Welcome to the Hailo Community!

Lets address the issues :

Are your results reasonable?

Yes, your configuration looks good:

DFC 3.32.0 with HailoRT 4.22.0
a16_w16 quantization with per-channel output scales
SNR values of 21–32 dB are acceptable for early-stage custom models

There’s no strict “good” SNR threshold for custom models, so your results don’t indicate any compilation or runtime issues.

Understanding FCR format in your outputs

Your parse-hef shows mixed formats:

Output conv9:  UINT16, FCR(76x100x65)
Output conv27: UINT16, FCR(76x100x1)
Output conv24: UINT16, NHWC(76x100x64)

FCR is a Hailo-specific layout optimized for internal dataflow. It follows [N, H, W, C] ordering but with width padded to 8-byte boundaries for outputs. This is normal and expected—the compiler automatically chooses FCR for optimal performance. You don’t need to recompile.

How to handle FCR outputs

Recommended approach (for debugging):

Configure your output vstreams with:

format.type = HAILO_FORMAT_TYPE_FLOAT32 → HailoRT handles dequantization with correct per-channel parameters
format.order = HAILO_FORMAT_ORDER_NHWC → HailoRT converts FCR to NHWC automatically, handling padding

This gives you clean FLOAT32 NHWC outputs that match HAR directly.

Alternative (manual handling):

If using HAILO_FORMAT_TYPE_AUTO:

You get raw UINT16 data
Must call vstream.get_quant_infos() for per-channel parameters
Apply dequantization: (raw_value - qp_zp) * qp_scale for each channel
Account for padded width when indexing FCR tensors

Why HEF and HAR outputs don’t match

Two common issues:

Per-channel quantization mismatch: With output_encoding_vector enabled, each channel has its own scale/zero-point. If HAR uses shared quantization or incorrect indexing, values will diverge.
Format interpretation: If HAR assumes NHWC but reads FCR tensors without accounting for padding, you’ll get misaligned data.

Recommended next steps

✓ Verify HAR uses per-channel quantization (one QP per channel, correctly indexed)
✓ Use HAILO_FORMAT_TYPE_FLOAT32 + HAILO_FORMAT_ORDER_NHWC for clean comparisons
✓ If issues persist, try compiling without output_encoding_vector as a diagnostic test
✓ Minor run-to-run variations with 4 contexts are normal for feature tensors

Let me know if you need clarification on any of these points!

Ko_Si · December 17, 2025, 9:35am

Many thanks @omria for your input. It seems like I am making progress.

I removed the per-channel scales and I still see some quantization accuracy issues from ONNX →HAR but I guess I can address these (I used a lower optimization level 2).

Also I am using on-chip de-quantization as you suggested (asking for float32) but although the HEF-on-chip results are now much closer to the quantized HAR output there is still a gap and my gut-feeling tells me this gap is too big

ONNX vs HAR vs HEF yields:
———————————-
ONNX features 1.79381, -0.31311, 0.31845, 0.27129, 0.2671
HAR features 1.14987, -0.4099 , 0.53287, 0.29556, 0.29556
HEF features 1.09594 -0.420685 0.569543 0.284772 0.304188
max diff (ONNX vs HAR) 13.0424
mean diff (ONNX vs HAR) 0.211032

ONNX keypoints -1.84859, -3.11717, -1.93641, -2.12238, -1.73115
HAR keypoints -1.85608, -3.0851 , -1.99582, -2.18573, -1.83995
HEF keypoints -1.79875 -2.88445 -1.62138 -1.81846 -1.73246
max diff (ONNX vs HAR) 3.38995
mean diff (ONNX vs HAR) 0.25468

ONNX heatmap 0.14378, 0.14145, 0.13122, 0.11162, 0.09499
HAR heatmap 0.14112, 0.13639, 0.12235, 0.10791, 0.10086
HEF heatmap 0.144475 0.136387 0.123722 0.107608 0.0987579
max diff (ONNX vs HAR) 0.0895682
mean diff (ONNX vs HAR) 0.00568469

Is this gap HAR vs HEF expected?