Big SNR drop after precision change right before attention matmul

Hello Hailo community

I am trying to run a Lightglue model on Hailo 8, and although I managed to parse it by splitting it in 3 chunks or even at the level of self and cross attention blocks, I am having issues now with the accuracy. This topic is related to this previous one

Currently the final descriptors at the final 9-th transformer layer has an SNR of 7dB approximately. It is difficult to analyze in the profiler the whole network or even one of the 3 chunks in which I split the network due the large number of layers, but at this level or at the level of individual attention blocks I identified a big drop in SNR after a precision change layer, right after the softmax and before the matmul that combines the V values. In the “alls” script, I specified that the “matmul” layers should work with “a16_w16” precision, so I am not sure why this change in precision to 8 bits is done. According to the Hailo profiler, around 91% of the weights are in 16bit precision.

Here is the two immediate paths with their SNRs right before this matmul layer for the case of the first self-attention block:
ew_mult_softmax2 (34 dB) → precision_change6 (26.63 dB) ----> |
conv_feature_splitter1_2 (36.33 dB) -------------------------------------> | —> matmul6 (30dB)

Since there are 18 of these attention blocks sequentially chained in the whole network the final accuracy suffers substantially. Do you recommend some solutions?
I think I will try to compute the softmax and the following matmul in the host CPU in high precision to check if that at least can recover almost the original accuracy, what do you think? any other ideas?

Thanks a lot in advance

Hi,

There’s a slight caveat about 16bit usage in Hailo architecture - the elementwise multiplication is constrained to be 8-bit, hence the forced precision changes (their exact location being somewhat variable according to internal implementation minutiae). Recommendations:

  • Evaluate bottom line task performance. 7dB sounds low but for many applications decent performance is observed in practice due to de-facto pooling or peak-extraction ops in the postprocessing stage which amount to noise-reduction.
  • Experiment with the calibration set size. Attention layers often have wide ranges which might be better off with fully accommodating, or conversely, outlier-clipping ranges, depending on case.
  • If performance still unsatisfactory, experiment with higher “optimization levels” (requires a GPU and some patience:).

Good luck!

1 Like