[PLZ] Is it possible to perform 4-bit quantization on the multi-head attention layers of the DETR model?

GGURIg · September 19, 2024, 3:39pm

Hello, I am a student studying quantization.

I would like to apply quantization to DETR available in the model zoo.

Is it possible to perform 4-bit quantization on the multi-head attention layers?

It seems like I can use the function quantization_param(conv3, precision_mode=a8_w4).

But can I apply it to transformer models as well?

omria · September 25, 2024, 1:39pm

Yes, it is possible to perform 4-bit quantization on the multi-head attention layers within transformer models like DETR using the Hailo Dataflow Compiler (DFC). The quantization_param(conv3, precision_mode=a8_w4) can indeed be applied to transformer models, allowing 4-bit quantization for part of the model’s weights, which can be especially beneficial for reducing the memory footprint and increasing performance on constrained devices.

However, it’s important to note that:

Hailo DFC supports 4-bit quantization on 20% of the model weights by default for large models .
This is part of the optimization step, and for transformer models, especially multi-head attention layers, the application of 4-bit quantization should be explicitly configured within the model’s quantization settings.

You can use Hailo’s model optimization tools to further control this behavior, ensuring that the model maintains accuracy while benefiting from the more aggressive weight quantization strategy.

Let me know if you need further help with specific configuration steps!

Topic		Replies	Views
Dequantization for yolov3_416 Model in Hailo Model Explorer General	2	167	July 2, 2024
DFC parser cannot parse a Tensorflow model trained with Quantization aware training (QAT) General dfc	3	178	October 11, 2024
Can DFC oompile ONNX model trained with QAT from pytorch General	7	145	November 18, 2024
How can I dequantize the result of detr_resnet? General hailo8	2	34	August 29, 2025
Supported layer General	1	91	October 17, 2024

[PLZ] Is it possible to perform 4-bit quantization on the multi-head attention layers of the DETR model?

Related topics