Hailo-10H / DFC v5.3.0: a16_w16 on a transformer encoder is not a blanket allocator wall — it's a 3-stage cascade (attention crash → a16-conv nan exponent → needs super-defuse). What is the intended 16-bit path?

This is a follow-up to my earlier post “Transformer encoders do compile on DFC v5.3.0 — the LayerNorm-decomposition wall is a 3D-vs-4D calibration mismatch.” In that post I claimed the 16-bit path was blocked because “any a16_w16 layer crashes the backend allocator.” After a much deeper pass, that claim is wrong, and I want to correct it and ask Hailo for the intended path.

Short version: 16-bit is not categorically rejected — a16_w16 on a benign (non-attention) layer compiles fine. Instead, pushing a full encoder toward 16-bit activations hits three different walls in sequence, each of which I can characterize precisely. I never reached a valid 16-bit HEF, so I also cannot yet report whether 16-bit even recovers the retrieval quality the 8-bit HEF loses.

Environment

  • Raspberry Pi 5 (aarch64, 16 GB) + Hailo-10H (8 GB LPDDR4), HailoRT v5.3.0
  • Compile host: x86_64, DFC (hailo_sdk_client) v5.3.0 from hailo_ai_sw_suite_2026-04:1, TF 2.19.1, Python 3.10, CPU only
  • Model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 (XLM-R, 12 layers, hidden 384, seq 256) — a sentence-embedding encoder
  • Parsed HN: 225 layers — 72 conv, 50 normalization, 37 ew_add, 25 layer_normalization, 24 matmul, 12 softmax

1. Correction: 16-bit is not categorically rejected

Applying a16_w16 to a single benign conv (conv72, the last FFN output projection, far from any attention op) compiles end-to-end on a stock SDK:

[info] multilingual_..._v2 Successful Partition and Allocation (duration: 8m 05s)
[info] multilingual_..._v2 Successful Kernel Compilation (duration: 4s)
[info] multilingual_..._v2 Successful Compilation (duration: 8m 20s)
-> HEF 20.4 MB

So the allocator does not reject 16-bit per se. My earlier “any a16_w16 crashes” conclusion was an artifact of the fact that all four of my original 16-bit configs put the 16-bit on the attention softmax. The real picture is a cascade.

Stage Where 16-bit is applied Failure
Wall 1 a16_w16 on attention (softmax) BackendAllocatorException: ... unexpected crash (in compile())
Wall 2 a16 on conv layers nan exponent in conv_low_sub_layer/act_op (in optimize())
Wall 3 16-bit activations on full-width layers ... needs super-defuse (in compile())

2. Wall 1 — a16_w16 on attention → allocator crash

With 16-bit on the attention softmax, compile() crashes after finding valid partitions:

[error] Failed to produce compiled graph
[error] BackendAllocatorException: Compilation failed with unexpected crash

This is the same failure reported for a LightGlue self-attention block in Compiling with 16-bit inputs crashes (that thread is unresolved). Keeping softmax and the attention matmuls at 8-bit avoids it — which is what let me get past this wall and reach Wall 2.


3. Wall 2 — a16 on convs → nan exponent in the low sub-layer

Applying a16_w8 broadly (16-bit activations, 8-bit weights) dies in optimize(), first as:

File ".../acceleras/lossy_elements/quant_element.py", line 248, in a_b_factorize
    b_fac = np.arange(max(np.floor(target / max_a), 1), max_b + 1)
ValueError: arange: cannot compute length

and, if that factorization is guarded, as:

hailo_model_optimization.acceleras.utils.acceleras_exceptions.AccelerasNegativeSlopesError:
the exponent [[nan]] is not in range [ 7 22] for layer
multilingual_..._v2/conv4/conv_low_sub_layer/act_op

What’s happening: the a16 conv decomposition splits each conv into a high and a low sub-layer. For this model, the low sub-layer’s scale computes to zero → exponent nan. Instrumenting a full pass shows this is true for all 72 conv low sub-layers — i.e. every conv’s weights are representable in the high 8-bit part, so the low residual is ≈0. (A useful implication: a16_w8 here is effectively 16-bit activations + 8-bit weights.)

It is not a calibration artifact. The degenerate set is unstable across calibration sets, and it persists with optimization_level=1 and with real, diverse calibration text (2,500 vault paragraphs). So richer calibration does not fix it.

I could only get past this with a runtime monkey-patch (not a supported path — flagging it only because it reveals Wall 3): force the degenerate low sub-layers to contribute exactly zero — check_exp_range → warn instead of raise, and nan_to_num(...) on the exported weights. With that, optimize() completes cleanly with 16-bit activations:

[H] optimize OK. sanitized low-sublayers: 72

4. Wall 3 — 16-bit activations → needs super-defuse on norm / elementwise

With optimize() done, compile() then fails:

[error] Failed to optimize bucket_0 with error: Assignment of normalization1 needs super-defuse
[error] BackendAllocatorException: Compilation failed: Assignment of normalization1 needs super-defuse

Setting the normalization layers back to a8_w8 just moves the same failure to the next wide layer:

[error] BackendAllocatorException: Compilation failed: Assignment of mul_and_add1 needs super-defuse

Interpretation: a full-width 16-bit activation tensor ([SEQ=256, HIDDEN=384]) exceeds a single resource assignment and needs to be super-defused. But DefuseType (.../script_parser/commands.py) only exposes conv variants — SUPER_CONV, SUPER_DW, SUPER_DECONV — so normalization / mul_and_add / elementwise layers appear to have no applicable super-defuse, and the allocator errors out instead of splitting them. Excluding one just exposes the next.

Net result: 16-bit activations do not compile for this encoder at SEQ=256 on DFC v5.3.0. I never obtained a valid 16-bit HEF, so the open quality question — does 16-bit recover the retrieval the 8-bit HEF loses (70.8% top-1 vs 100% FP32) — remains unmeasured.


5. What I tried (14 compiles, summarized)

  • a16_w16 on one benign conv → compiles (§1).
  • a16_w16 / a16_w8 on attention → Wall 1.
  • a16_w8 global, optimization_level 0 and 1, synthetic and real vault calibration → Wall 2 (nan), calibration-independent.
  • Per-layer exclusion of degenerate convs → whack-a-mole; the set shifts with calibration.
  • Runtime sanitize of degenerate low sub-layers → optimize() passes → Wall 3.
  • Exclude normalization from 16-bit → Wall 3 moves to mul_and_add.

6. Questions / requests for Hailo

  1. What is the intended way to compile 16-bit activations for a transformer encoder on the Hailo-10H? Is a16_w16/a16_w8 on the attention softmax/matmul expected to crash the allocator (Wall 1), and is there a supported config that lets 16-bit attention through — or is that only fixed in a later DFC?
  2. The a16 conv low sub-layer nan exponent (Wall 2): is a zero-valued low residual (weights fully representable in the high 8-bit part) a known degenerate case? Is there a supported directive to make such a low sub-layer contribute zero, rather than a user-side patch?
  3. needs super-defuse on normalization / mul_and_add (Wall 3): is there a supported way to (super-)defuse wide 16-bit non-conv layers — e.g. an allocator_param / compilation_param to enable automatic super-defuse — or is reducing SEQ the only option to bring the 16-bit activation footprint under the threshold?
  4. Is any of the above resolved in a DFC newer than v5.3.0 (hailo_ai_sw_suite_2026-04)?

Environment: RPi5 (aarch64) + Hailo-10H, HailoRT 5.3.0; compile host x86_64, DFC v5.3.0 (hailo_ai_sw_suite_2026-04:1), TF 2.19.1, CPU-only. Model: paraphrase-multilingual-MiniLM-L12-v2 (XLM-R, 12 layers, seq 256).


If you need more info please, let me know

Thanks!