LightGlue self-attention block - parsed model produces different outputs

Hi Hailo Community

I am working on trying to run LightGlue model with Hailo 8 accelerators.

I had to leave aside the positional encoding and some final-layers related to filtering matches but I managed to parse the model after changing a bit how the keypoints encoding are passed.

However, when I compared the parsed HAR model with the associated onnx model, the output differ quite a lot for the same input when I load the pre-trained weights; when the model is randomly initialized the outputs are closer.

I isolated the problem to the self-attention block of lightglue and below is the code to reproduce these output discrepancies.

Environment:

HailoRT v4.23.0
Hailo Dataflow Compiler v3.33.0

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import onnxruntime as ort
from hailo_sdk_client import ClientRunner, InferenceContext

# ==========================================
# Model Definitions (Unchanged Logic)
# ==========================================

class Attention(nn.Module):
    def __init__(self) -> None:
        super().__init__()

    def forward(self, q, k, v) -> torch.Tensor:
        return F.scaled_dot_product_attention(q, k, v)

class SelfBlock(nn.Module):
    def __init__(self, embed_dim: int, num_heads: int, bias: bool = True) -> None:
        super().__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        self.Wqkv = nn.Linear(embed_dim, 3 * embed_dim, bias=bias)
        self.inner_attn = Attention()
        self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
        self.ffn = nn.Sequential(
            nn.Linear(2 * embed_dim, 2 * embed_dim),
            nn.LayerNorm(2 * embed_dim, elementwise_affine=True),
            nn.GELU(),
            nn.Linear(2 * embed_dim, embed_dim),
        )

    def forward(self, x: torch.Tensor, sines: torch.Tensor, cosines: torch.Tensor) -> torch.Tensor:
        cosines = cosines.unsqueeze(1)
        sines = sines.unsqueeze(1)
        encodings = torch.stack([cosines, sines], 0)
        return self.forward_original(x, encodings)

    def forward_original(self, x: torch.Tensor, encodings: torch.Tensor) -> torch.Tensor:
        batch = x.shape[0]
        qkv: torch.Tensor = self.Wqkv(x)
        qkv = qkv.reshape(batch, -1, self.num_heads, self.head_dim, 3)
        qkv = qkv.transpose(1, 2)
        q, k, v = qkv[..., 0], qkv[..., 1], qkv[..., 2]
        q = self.apply_cached_rotary_emb(encodings, q)
        k = self.apply_cached_rotary_emb(encodings, k)
        context = self.inner_attn(q, k, v)
        context = context.transpose(1, 2)
        context = context.reshape(batch, -1, self.embed_dim)
        message = self.out_proj(context)
        return x + self.ffn(torch.cat((x, message), -1))

    def rotate_half(self, t: torch.Tensor) -> torch.Tensor:
        batch = t.shape[0]
        t = t.reshape(batch, self.num_heads, -1, self.head_dim // 2, 2)
        t = torch.stack((-t[..., 1], t[..., 0]), dim=-1)
        t = t.reshape(batch, self.num_heads, -1, self.head_dim)
        return t

    def apply_cached_rotary_emb(
        self, freqs: torch.Tensor, t: torch.Tensor
    ) -> torch.Tensor:
        return (t * freqs[0]) + (self.rotate_half(t) * freqs[1])

# ==========================================
# Helper Functions
# ==========================================

def load_lightglue_weights(model: nn.Module) -> None:
    """Downloads LightGlue weights and loads the first SelfBlock layer into the model."""
    url = "https://github.com/cvg/LightGlue/releases/download/{}/{}_lightglue.pth"
    version = "v0.1_arxiv"
    features = "superpoint"
    fname = f"lightglue_{version}_{features}.pth".replace(".", "-")
    
    state_dict = torch.hub.load_state_dict_from_url(
        url.format(version, features), file_name=fname
    )

    prefix = "self_attn.0."
    self_block_state_dict = {
        k[len(prefix):]: v 
        for k, v in state_dict.items() 
        if k.startswith(prefix)
    }

    load_info = model.load_state_dict(self_block_state_dict, strict=False)

    print("\n=== LOAD SUMMARY FOR SELFBLOCK ===")
    print(f"Total layers in SelfBlock: {len(model.state_dict())}")
    print(f"Successfully loaded:       {len(model.state_dict()) - len(load_info.missing_keys)}")
    print(f"Missing keys:              {len(load_info.missing_keys)}")
    print("==================================\n")

# ==========================================
# Main Execution
# ==========================================

def main():
    # --- Configurations ---
    load_weights = True
    embed_dim = 256
    num_heads = 4
    head_dim = embed_dim // num_heads
    batch_size = 3
    seq_len = 200
    
    onnx_path = "self_attn.onnx"
    hailo_model_har_name = "self_attn.har"
    model_name = "self_attn"
    input_names = ["desc", "sines", "cosines"]
    output_names = ["new_desc"]

    # --- Initialize & Load Model ---
    self_attn = SelfBlock(embed_dim, num_heads)
    if load_weights:
        load_lightglue_weights(self_attn)
    self_attn.eval()

    # --- Generate Dummy Data ---
    desc = torch.randn((batch_size, seq_len, embed_dim))
    sines = torch.rand((batch_size, seq_len, head_dim)) * 2 - 1
    cosines = torch.rand((batch_size, seq_len, head_dim)) * 2 - 1

    # --- Export to ONNX ---
    torch.onnx.export(
        self_attn,
        (desc, sines, cosines),
        onnx_path,
        opset_version=17,
        do_constant_folding=True,
        input_names=input_names,
        output_names=output_names,
        dynamic_axes=None,
    )

    # --- Hailo Translation ---
    runner = ClientRunner(hw_arch="hailo8")
    hn, npz = runner.translate_onnx_model(
        onnx_path,
        model_name,
        start_node_names=input_names,
        end_node_names=output_names,
    )
    output_names = ["new_desc"] # the translate_onnx_model has overwritten it, so need to set it again
    runner.save_har(hailo_model_har_name)

    # --- ONNX Runtime Inference ---
    onnx_input_data = {
        "desc": desc.numpy(),
        "sines": sines.numpy(),
        "cosines": cosines.numpy(),
    }
    
    session = ort.InferenceSession(onnx_path, providers=['CPUExecutionProvider'])
    onnx_outputs = session.run(output_names, onnx_input_data)

    # --- Hailo Map Inputs ---
    hn_model = runner.get_hn_model()
    hailo_input_data = {}
    
    for layer in hn_model.get_input_layers():
        onnx_name = layer.original_names[0]
        if onnx_name in onnx_input_data:
            # Hailo expects an extra dimension for spatial data
            hailo_input_data[layer.name] = np.expand_dims(onnx_input_data[onnx_name], axis=1)
            print(f"Mapped ONNX '{onnx_name}' -> Hailo '{layer.name}'")
        else:
            print(f"WARNING: Unknown input requirement: {onnx_name}")

    # --- Hailo Inference ---
    with runner.infer_context(InferenceContext.SDK_NATIVE) as ctx:
        hailo_outputs = runner.infer(ctx, hailo_input_data)

    # --- Comparison ---
    # runner.infer returns a list of outputs; we compare the first output array
    error = np.abs(hailo_outputs.squeeze(1) - onnx_outputs[0])
    
    print(f"\nError Max:  {np.max(error):.6f}")
    print(f"Error Mean: {np.mean(error):.6f}")

if __name__ == "__main__":
    main()

When I don’t load the weights I got these discrepancies:

Error Max: 0.016217
Error Mean: 0.002340

And when I load them, these ones:
Error Max: 4.696563
Error Mean: 0.490858

Interestingly, when I change the number of heads from 4 to 1, I still can load the weights without problem and the discrepancies are nearly 0.

What was the reason for these discrepancies? is the math of the parsed graph different from the original one? Or is just a matter of some approximated functions that can be mitigated later with calibration data?
Any help that allows me to run this lightglue model in hailo 8 or newer accelerators will be very much appreciate. Thanks in advance!

Hi, Alex here.

This is very likely a solvable pre-processing or similar consistency issue. Ideally, for simplicity, split the ONNX into pre/post/neural processing parts and parse end-to-end (w/o start-end nodes) so that HAR corresponds directly to a specific ONNX.

Please verify that:

(A) The HAR has same structure and I/O as the ONNX it replaces (viewing in Netron)

(B) Input is the same up to appropriate transpose (onnx BCHW, hailo BHWC)

(C) Normalize the error by the signal for fair comparison

1 Like

Hi Alex,

Thanks for the help. It was difficult to check in Netron for differences in structure but when using the “apply_cached_rotary_emb” the discrepancies started to appear, at least in the shape of the tensors.

At the end I managed to produce an equivalent HAR model by using only 3D tensors. At some point the model was using 4D and even 5D tensors and I thought that Hailo DFC may could have difficulties with operations in 5D at least, like reshapes, etc.

This custom self-attention block below can load the original weights and its associated parsed model is equivalent to the original one.

class LinearRotateHalf(nn.Module):
    def __init__(self, embed_dim: int):
        super().__init__()
        self.embed_dim = embed_dim
        
        # Pre-compute the rotation matrix M
        mask = torch.zeros((embed_dim, embed_dim))
        
        for i in range(0, embed_dim, 2):
            mask[i+1, i] = -1.0
            mask[i, i+1] = 1.0
            
        # Register as a buffer so it's saved with the model but not trained
        self.register_buffer('rotation_matrix', mask)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x is [B, N, D]
        return torch.matmul(x, self.rotation_matrix)


class CustomSelfBlock(nn.Module):
    def __init__(self, embed_dim: int, num_heads: int, bias: bool = True) -> None:
        super().__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        self.Wqkv = nn.Linear(embed_dim, 3 * embed_dim, bias=bias)
        self.inner_attention = CustomAttention(num_heads=num_heads)
        self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
        self.ffn = nn.Sequential(
            nn.Linear(2 * embed_dim, 2 * embed_dim),
            nn.LayerNorm(2 * embed_dim, elementwise_affine=True),
            nn.GELU(),
            nn.Linear(2 * embed_dim, embed_dim),
        )
        self.rotate_half_mat = LinearRotateHalf(embed_dim)

    def forward(self, x: torch.Tensor, s: torch.Tensor, c: torch.Tensor) -> torch.Tensor:
        W = self.Wqkv.weight.view(self.num_heads, self.head_dim, 3, self.embed_dim)
        b = self.Wqkv.bias.view(self.num_heads, self.head_dim, 3)
        
        Wq = W[:, :, 0, :].reshape(self.embed_dim, self.embed_dim)
        Wk = W[:, :, 1, :].reshape(self.embed_dim, self.embed_dim)
        Wv = W[:, :, 2, :].reshape(self.embed_dim, self.embed_dim)
        
        bq = b[:, :, 0].reshape(self.embed_dim)
        bk = b[:, :, 1].reshape(self.embed_dim)
        bv = b[:, :, 2].reshape(self.embed_dim)
        
        # Now apply three separate linear projections. 
        q = F.linear(x, Wq, bq) # [B, N, D]
        k = F.linear(x, Wk, bk) # [B, N, D]
        v = F.linear(x, Wv, bv) # [B, N, D]

        q = self.apply_cached_rotary_emb(s, c, q)
        k = self.apply_cached_rotary_emb(s, c, k)
        context = self.inner_attention(q, k, v)
        message = self.out_proj(context)
        
        return x + self.ffn(torch.cat((x, message), -1))

    def apply_cached_rotary_emb(
        self, s: torch.Tensor, c: torch.Tensor, t: torch.Tensor
    ) -> torch.Tensor:
        return (t * s) + (self.rotate_half_mat(t) * c)

By doing something similar to the cross-attention block, I could finally parsed the whole LightGlue model; although only in 3 chunks due to another error in the parsing.

However the problem I have now is with the optimization and compilation phases.
With level 4 for the optimization, I took around 5 hours just for the optimization, with a dataset of 237 images and CPU. But the big problem is with the compilation, since it passed 19h and it didn’t finish. I stopped the compilation and rerun the two phases with this ALLS script (taking inspiration from the swin small alls script):

    model_script = [
        "post_quantization_optimization(finetune, policy=enabled, batch_size=4)",
        "pre_quantization_optimization(ew_add_fusing, policy=disabled)",
        "model_optimization_flavor(optimization_level=0, compression_level=0)",
        "performance_param(compiler_optimization_level=0)"
    ]

Even with this fast optimization level it already passed 36 minutes and the compilation phase hasn’t finished yet. Do you recommend other ALLS script for this kind of model?

Any help to allow a good quantized and fast compiled model will be very appreciated.

Kind regards,
Fernando

If it serves of any help, I past here the logs of the compilation phase, which I ended after 19 hours; only the beginning and the end since it is very large file.

[2026-03-04 13:57:49.189] [default] [info] Loading network parameters
[2026-03-04 13:57:49.971] [default] [info] e[1;36mStarting Hailo allocation and compilation flowe[0m
[2026-03-04 13:57:49.993] [default] [info] Model name: matcher_chunk1
[2026-03-04 13:58:06.851] [default] [info] Building optimization options for network layers...
[2026-03-04 13:58:15.881] [default] [info] Successfully built optimization options - 9s 29ms
[2026-03-04 13:58:15.886] [default] [info] Trying to compile the network in a single context
[2026-03-04 13:58:15.886] [default] [info] Trying to solve in single context
[2026-03-04 13:58:15.961] [default] [info] Single context flow failed: Recoverable single context error
[2026-03-04 13:58:16.003] [default] [info] Building optimization options for network layers...
[2026-03-04 13:58:28.863] [default] [info] Successfully built optimization options - 12s 860ms
[2026-03-04 13:58:28.864] [default] [info] Using Multi-context flow
[2026-03-04 13:58:28.864] [default] [info] Resources optimization params: max_control_utilization=60%, max_compute_utilization=60%, max_compute_16bit_utilization=60%, max_memory_utilization (weights)=60%, max_input_aligner_utilization=60%, max_apu_utilization=60%
[2026-03-04 13:58:28.866] [default] [info] e[1;36mFinding the best partition to contexts...e[0m
[2026-03-04 13:58:42.753] [default] [info] Iteration failed on: Too many inputs/outputs for matcher_chunk1_context_0, try to reduce number of inputs or outputs
Number of DDRs: 0
Number of inputs: 6
Number of outputs: 37

[2026-03-04 13:58:43.212] [default] [info] Iteration failed on: Automri finished with too many resources on context_1
[2026-03-04 13:58:44.887] [default] [info] Iteration failed on: Automri finished with too many resources on context_2
[2026-03-04 13:58:45.365] [default] [info] Iteration failed on: Automri finished with too many resources on context_3
[2026-03-04 13:58:45.808] [default] [info] Iteration failed on: Automri finished with too many resources on context_4
[2026-03-04 13:58:46.251] [default] [info] Iteration failed on: Automri finished with too many resources on context_5
[2026-03-04 13:58:46.595] [default] [info] Iteration failed on: Automri finished with too many resources on context_6
[2026-03-04 13:58:47.026] [default] [info] Iteration failed on: Automri finished with too many resources on context_7
[2026-03-04 13:58:47.453] [default] [info] Iteration failed on: Automri finished with too many resources on context_8
[2026-03-04 13:58:47.894] [default] [info] Iteration failed on: Automri finished with too many resources on context_9
[2026-03-04 13:58:48.351] [default] [info] Iteration failed on: Automri finished with too many resources on context_10
[2026-03-04 13:58:57.759] [default] [info] Iteration failed on: Too many inputs/outputs for matcher_chunk1_context_0, try to reduce number of inputs or outputs
Number of DDRs: 0
Number of inputs: 6
Number of outputs: 38

[2026-03-04 13:58:58.246] [default] [info] Iteration failed on: Automri finished with too many resources on context_1
[2026-03-04 13:58:58.694] [default] [info] Iteration failed on: Automri finished with too many resources on context_2
[2026-03-04 13:58:59.151] [default] [info] Iteration failed on: Automri finished with too many resources on context_3
[2026-03-04 13:58:59.564] [default] [info] Iteration failed on: Automri finished with too many resources on context_4
[2026-03-04 13:59:00.006] [default] [info] Iteration failed on: Automri finished with too many resources on context_5
[2026-03-04 13:59:00.352] [default] [info] Iteration failed on: Automri finished with too many resources on context_6
[2026-03-04 13:59:00.822] [default] [info] Iteration failed on: Automri finished with too many resources on context_7
[2026-03-04 13:59:01.282] [default] [info] Iteration failed on: Automri finished with too many resources on context_8
[2026-03-04 13:59:01.757] [default] [info] Iteration failed on: Automri finished with too many resources on context_9
[2026-03-04 13:59:02.262] [default] [info] Iteration failed on: Automri finished with too many resources on context_10
[2026-03-04 13:59:12.094] [default] [info] Iteration failed on: Too many inputs/outputs for matcher_chunk1_context_0, try to reduce number of inputs or outputs
Number of DDRs: 0
Number of inputs: 6
Number of outputs: 39

[2026-03-04 13:59:13.139] [default] [info] Iteration failed on: Automri finished with too many resources on context_1
[2026-03-04 13:59:13.595] [default] [info] Iteration failed on: Automri finished with too many resources on context_2
[2026-03-04 13:59:14.034] [default] [info] Iteration failed on: Automri finished with too many resources on context_3
[2026-03-04 13:59:14.453] [default] [info] Iteration failed on: Automri finished with too many resources on context_4
[2026-03-04 13:59:14.916] [default] [info] Iteration failed on: Automri finished with too many resources on context_5
[2026-03-04 13:59:15.265] [default] [info] Iteration failed on: Automri finished with too many resources on context_6
[2026-03-04 13:59:15.699] [default] [info] Iteration failed on: Automri finished with too many resources on context_7
[2026-03-04 13:59:16.129] [default] [info] Iteration failed on: Automri finished with too many resources on context_8
[2026-03-04 13:59:16.580] [default] [info] Iteration failed on: Automri finished with too many resources on context_9
[2026-03-04 13:59:17.031] [default] [info] Iteration failed on: Automri finished with too many resources on context_10
[2026-03-04 13:59:26.171] [default] [info] Iteration failed on: Too many inputs/outputs for matcher_chunk1_context_0, try to reduce number of inputs or outputs
Number of DDRs: 0
Number of inputs: 6
Number of outputs: 40

[2026-03-04 13:59:26.644] [default] [info] Iteration failed on: Automri finished with too many resources on context_1
[2026-03-04 13:59:27.113] [default] [info] Iteration failed on: Automri finished with too many resources on context_2
[2026-03-04 13:59:27.554] [default] [info] Iteration failed on: Automri finished with too many resources on context_3
[2026-03-04 13:59:27.965] [default] [info] Iteration failed on: Automri finished with too many resources on context_4
[2026-03-04 13:59:28.424] [default] [info] Iteration failed on: Automri finished with too many resources on context_5
[2026-03-04 13:59:28.775] [default] [info] Iteration failed on: Automri finished with too many resources on context_6
[2026-03-04 13:59:29.221] [default] [info] Iteration failed on: Automri finished with too many resources on context_7
[2026-03-04 13:59:29.661] [default] [info] Iteration failed on: Automri finished with too many resources on context_8
[2026-03-04 13:59:30.132] [default] [info] Iteration failed on: Automri finished with too many resources on context_9
[2026-03-04 13:59:30.640] [default] [info] Iteration failed on: Automri finished with too many resources on context_10
[2026-03-04 13:59:40.098] [default] [info] Iteration failed on: Too many inputs/outputs for matcher_chunk1_context_0, try to reduce number of inputs or outputs
Number of DDRs: 0
Number of inputs: 6
Number of outputs: 40

[2026-03-04 13:59:41.194] [default] [info] Iteration failed on: Automri finished with too many resources on context_1
[2026-03-04 13:59:41.691] [default] [info] Iteration failed on: Automri finished with too many resources on context_2
[2026-03-04 13:59:42.168] [default] [info] Iteration failed on: Automri finished with too many resources on context_3
[2026-03-04 13:59:42.632] [default] [info] Iteration failed on: Automri finished with too many resources on context_4
[2026-03-04 13:59:43.112] [default] [info] Iteration failed on: Automri finished with too many resources on context_5
[2026-03-04 13:59:43.501] [default] [info] Iteration failed on: Automri finished with too many resources on context_6
[2026-03-04 13:59:44.632] [default] [info] Iteration failed on: Automri finished with too many resources on context_7
[2026-03-04 13:59:44.990] [default] [info] Iteration failed on: Automri finished with too many resources on context_8
[2026-03-04 13:59:45.336] [default] [info] Iteration failed on: Automri finished with too many resources on context_9
[2026-03-04 13:59:45.692] [default] [info] Iteration failed on: Automri finished with too many resources on context_10
[2026-03-04 13:59:55.033] [default] [info] Iteration failed on: Too many inputs/outputs for matcher_chunk1_context_0, try to reduce number of inputs or outputs
Number of DDRs: 0
Number of inputs: 6
Number of outputs: 39

[2026-03-04 13:59:55.412] [default] [info] Iteration failed on: Automri finished with too many resources on context_1
[2026-03-04 13:59:55.764] [default] [info] Iteration failed on: Automri finished with too many resources on context_2
[2026-03-04 13:59:56.107] [default] [info] Iteration failed on: Automri finished with too many resources on context_3
[2026-03-04 13:59:56.414] [default] [info] Iteration failed on: Automri finished with too many resources on context_4
[2026-03-04 13:59:56.787] [default] [info] Iteration failed on: Automri finished with too many resources on context_5
[2026-03-04 13:59:57.170] [default] [info] Iteration failed on: Automri finished with too many resources on context_6
[2026-03-04 13:59:57.539] [default] [info] Iteration failed on: Automri finished with too many resources on context_7
[2026-03-04 13:59:57.920] [default] [info] Iteration failed on: Automri finished with too many resources on context_8
[2026-03-04 13:59:58.296] [default] [info] Iteration failed on: Automri finished with too many resources on context_9
[2026-03-04 13:59:58.684] [default] [info] Iteration failed on: Automri finished with too many resources on context_10
[2026-03-04 14:00:07.974] [default] [info] Iteration failed on: Too many inputs/outputs for matcher_chunk1_context_0, try to reduce number of inputs or outputs
Number of DDRs: 0
Number of inputs: 6
Number of outputs: 39

[OMITTED PART]

[2026-03-05 09:16:17.687] [default] [info] Iteration failed on: Automri finished with too many resources on context_37
[2026-03-05 09:16:19.589] [default] [info] Iteration failed on: Automri finished with too many resources on context_38
[2026-03-05 09:16:21.484] [default] [info] Iteration failed on: Automri finished with too many resources on context_39
[2026-03-05 09:17:01.815] [default] [info] Iteration failed on: Too many inputs/outputs for matcher_chunk1_context_36, try to reduce number of inputs or outputs
Number of DDRs: 0
Number of inputs: 25
Number of outputs: 16

[2026-03-05 09:17:03.907] [default] [info] Iteration failed on: Automri finished with too many resources on context_37
[2026-03-05 09:17:05.835] [default] [info] Iteration failed on: Automri finished with too many resources on context_38
[2026-03-05 09:17:07.731] [default] [info] Iteration failed on: Automri finished with too many resources on context_39
[2026-03-05 09:17:44.281] [default] [info] Iteration failed on: Too many inputs/outputs for matcher_chunk1_context_36, try to reduce number of inputs or outputs
Number of DDRs: 0
Number of inputs: 25
Number of outputs: 17

[2026-03-05 09:17:46.218] [default] [info] Iteration failed on: Automri finished with too many resources on context_37
[2026-03-05 09:17:48.026] [default] [info] Iteration failed on: Automri finished with too many resources on context_38
[2026-03-05 09:17:49.860] [default] [info] Iteration failed on: Automri finished with too many resources on context_39
[2026-03-05 09:18:27.299] [default] [info] Iteration failed on: Too many inputs/outputs for matcher_chunk1_context_36, try to reduce number of inputs or outputs
Number of DDRs: 0
Number of inputs: 25
Number of outputs: 18

[2026-03-05 09:18:29.326] [default] [info] Iteration failed on: Automri finished with too many resources on context_37
[2026-03-05 09:18:31.195] [default] [info] Iteration failed on: Automri finished with too many resources on context_38
[2026-03-05 09:18:33.054] [default] [info] Iteration failed on: Automri finished with too many resources on context_39
[2026-03-05 09:19:07.915] [default] [info] Iteration failed on: Too many inputs/outputs for matcher_chunk1_context_36, try to reduce number of inputs or outputs
Number of DDRs: 0
Number of inputs: 25
Number of outputs: 18

[2026-03-05 09:19:09.963] [default] [info] Iteration failed on: Automri finished with too many resources on context_37
[2026-03-05 09:19:11.845] [default] [info] Iteration failed on: Automri finished with too many resources on context_38
[2026-03-05 09:19:13.738] [default] [info] Iteration failed on: Automri finished with too many resources on context_39
[2026-03-05 09:19:53.477] [default] [info] Iteration failed on: Too many inputs/outputs for matcher_chunk1_context_36, try to reduce number of inputs or outputs
Number of DDRs: 0
Number of inputs: 25
Number of outputs: 17

It seems that the there were too many inputs/outputs between contexts and it the compiler couldn’t find a solution. Finally, by compiling individual self and attention blocks I managed to get successful compilations. I just need to combine them during inference to have an equivalent full model.

Also, I have a few more questions:

  1. What is the limit of inputs/outputs between contexts? I saw in the logs messages that the current ones were too many, but I am not sure what was the limit.

  2. Do you think a model like lightglue will have a large accuracy drop when running on Hailo 8?

  3. I am considering to use some kind of attention mask in order to introduce dummy keypoints so that the input shape is always the same (as required by Hailo). But I guess that the large negative values introduced by these mask before the softmax will negatively impact the quantization. So, what is the recommended way to do this masking?

Hi, glad to hear you made a progress with your project!

Regarding static shapes, yes this is necessary at the moment; normally the way is custom export to ONNX (from original repo) that forces fixed shapes and drops all dynamic elements. Not sure I follow your masking idea.

Re accuracy drop, the bulk of network seems similar to ViT so that should be a decent reference in terms of accuracy, though the specifics of the task could matter too.

Regarding compilation and inter-context I/O, I don’t think there’s a hard limit on number of streams, but some internal constraints might show up…

Hi again,
Now I am having accuracy problems. I am doing sequential calibration, using the outputs of a quantized block to calibrate the subsequent block. The model comprises of 9 transformer layers, each one with two self-attention blocks running in parallel for each image, followed by a cross-attention block.
Apart from this sequential calibration I am trying to optimize the model with as many layers in 16bit precision as I can but I have significant drops in SNR at the output of some blocks:

Here you can see the config script I am using for each block (indicated by the comment after each line)

model_script = [
    "post_quantization_optimization(finetune, policy=enabled, batch_size=4)", # all
    "pre_quantization_optimization(ew_add_fusing, policy=disabled)", # all
    "model_optimization_flavor(optimization_level=2, compression_level=0)", # all
    "quantization_param({*matmul*}, precision_mode=a16_w16)", # conf_16_S
    "quantization_param({*normalization*}, precision_mode=a16_w16)",  # conf_16_S
    "quantization_param({*softmax*}, precision_mode=a16_w16)",  # conf_16_S
    "quantization_param({*ew_add*}, precision_mode=a16_w16)", # conf_16_S, conf_16_C
    "quantization_param({*ew_mult*}, precision_mode=a16_w16)", # conf_16_S, conf_16_C
    "quantization_param({*output*}, precision_mode=a16_w16)", # conf_16_S, conf_16_C
    "pre_quantization_optimization(matmul_correction, layers={*matmul*}, correction_type=zp_comp_block)", # all
    "performance_param(compiler_optimization_level=2)",  # all
    "context_switch_param(allow_auto_merge_in_multicontext=True)",  # all
    "allocator_param(automatic_ddr=True)"  # all
]

and with it I got these SNRs at the output of each block. Note the substantial drop in SNR at the output of the self-attention block in layer 2:

Layer 1
SNR (self-attention_desc0, conf_16_S): 28.576 dB
SNR (cross-attention_desc0, conf_16_C): 25.678 dB

Layer 2
SNR (self-attention_desc0, conf_16_S): 19.357 dB
SNR (cross-attention_desc0, conf_16_C): 17.794 dB

Layer 3
SNR (self-attention_desc0, conf_8): 17.948 dB
SNR (cross-attention_desc0, conf_8): 15.827 dB

Layer 4
SNR (self-attention_desc0, conf_8): 17.513 dB
SNR (cross-attention_desc0, conf_8): 14.775 dB

Layer 5
SNR (self-attention_desc0, conf_8): 15.317 dB
SNR (cross-attention_desc0, conf_8): 13.707 dB

Layer 6
SNR (self-attention_desc0, conf_8): 13.711 dB
SNR (cross-attention_desc0, conf_8): 11.298 dB

Layer 7
SNR (self-attention_desc0, conf_8): 10.995 dB
SNR (cross-attention_desc0, conf_8): 9.581 dB

Layer 8
SNR (self-attention_desc0, conf_8): 8.935 dB
SNR (cross-attention_desc0, conf_8): 8.886 dB

Layer 9
SNR (self-attention_desc0, conf_8): 8.155 dB
SNR (cross-attention_desc0, conf_8): 8.743 dB

Apart from that, these initial SNRs are very similar as when I was using conf_8 config for all blocks:

Layer 1
SNR (sa_desc0, conf_8): 28.576
SNR (ca_desc0, conf_8): 25.313

Layer 2
SNR (sa_desc0, conf_8): 19.327
SNR (ca_desc0, conf_8): 17.981

Layer 3
SNR (sa_desc0, conf_8): 18.717
SNR (ca_desc0, conf_8): 15.987

Layer 4
SNR (sa_desc0, conf_8): 17.936
SNR (ca_desc0, conf_8): 14.323

Layer 5
SNR (sa_desc0, conf_8): 14.936
SNR (ca_desc0, conf_8): 12.439

Layer 6
SNR (sa_desc0, conf_8): 12.687
SNR (ca_desc0, conf_8): 10.219

Layer 7
SNR (sa_desc0, conf_8): 10.349
SNR (ca_desc0, conf_8): 8.820

Layer 8
SNR (sa_desc0, conf_8): 8.173
SNR (ca_desc0, conf_8): 8.190

Layer 9
SNR (sa_desc0, conf_8): 7.506
SNR (ca_desc0, conf_8): 8.134

Any idea of what could be happening and how to improve the SNRs?
Thanks in advance!

For the sequential calibration, I am doing:

with runner_infer.infer_context(InferenceContext.SDK_QUANTIZED) as ctx:
                    infer_results = runner_infer.infer(ctx, calib_dataset_dict)

but now I realized that if this inference is an emulation and there are non-negligible discrepancies between this and the real hardware, the errors could compound too much when chaining many blocks. So I guess that in my case, I should obtain the sequential calibration data using the actual Hailo accelerator.

What do you think?