Yes, it is possible to perform 4-bit quantization on the multi-head attention layers within transformer models like DETR using the Hailo Dataflow Compiler (DFC). The quantization_param(conv3, precision_mode=a8_w4) can indeed be applied to transformer models, allowing 4-bit quantization for part of the model’s weights, which can be especially beneficial for reducing the memory footprint and increasing performance on constrained devices.
However, it’s important to note that:
Hailo DFC supports 4-bit quantization on 20% of the model weights by default for large models .
This is part of the optimization step, and for transformer models, especially multi-head attention layers, the application of 4-bit quantization should be explicitly configured within the model’s quantization settings.
You can use Hailo’s model optimization tools to further control this behavior, ensuring that the model maintains accuracy while benefiting from the more aggressive weight quantization strategy.
Let me know if you need further help with specific configuration steps!