Quantization Approach on Partitioned Model

Good day,

I am currently deploying a custom speech recognition model to Hailo-8, and I am encountering significant accuracy degradation after quantization. The model is a customized RNN architecture based on Mamba (state-space model). Since the compiler does not support the original stateful formulation, I rewrote the layers to be stateless. I manage the recurrent state tensors externally and pass them as inputs to each layer. Because of this design, I had to export each layer separately and compile them into individual HAR/HEF files.

I have completed the full inference pipeline, and functionally everything runs correctly. However, the quantized model produces severely degraded ASR outputs compared to the FP32 ONNX version. The main concern appears to be quantization accuracy.

For calibration:

  • Since each layer depends on the previous layer’s output,

  • I generated the calibration dataset by running inference through the original ONNX models layer by layer.

  • I used the raw outputs of each preceding layer as inputs for the next one.

  • I did not apply any additional scaling or normalization.

  • The layers use RMSNorm.

This means that:

  • Each layer receives inputs with different value ranges.

  • No explicit range normalization was applied before calibration.

Questions

  1. Is using raw per-layer outputs (with varying ranges) appropriate for calibration in this multi-HEF stateless setup?

  2. Are there recommended best practices for calibrating models that are split into multiple compiled segments?

Any guidance on improving quantization stability in this setup would be greatly appreciated.