This is a follow-up to my earlier post “Transformer encoders do compile on DFC v5.3.0 — the LayerNorm-decomposition wall is a 3D-vs-4D calibration mismatch.” In that post I claimed the 16-bit path was blocked because “any a16_w16 layer crashes the backend allocator.” After a much deeper pass, that claim is wrong, and I want to correct it and ask Hailo for the intended path.
Short version: 16-bit is not categorically rejected — a16_w16 on a benign (non-attention) layer compiles fine. Instead, pushing a full encoder toward 16-bit activations hits three different walls in sequence, each of which I can characterize precisely. I never reached a valid 16-bit HEF, so I also cannot yet report whether 16-bit even recovers the retrieval quality the 8-bit HEF loses.
Environment
- Raspberry Pi 5 (aarch64, 16 GB) + Hailo-10H (8 GB LPDDR4), HailoRT v5.3.0
- Compile host: x86_64, DFC (
hailo_sdk_client) v5.3.0 fromhailo_ai_sw_suite_2026-04:1, TF 2.19.1, Python 3.10, CPU only - Model:
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2(XLM-R, 12 layers, hidden 384, seq 256) — a sentence-embedding encoder - Parsed HN: 225 layers — 72
conv, 50normalization, 37ew_add, 25layer_normalization, 24matmul, 12softmax
1. Correction: 16-bit is not categorically rejected
Applying a16_w16 to a single benign conv (conv72, the last FFN output projection, far from any attention op) compiles end-to-end on a stock SDK:
[info] multilingual_..._v2 Successful Partition and Allocation (duration: 8m 05s)
[info] multilingual_..._v2 Successful Kernel Compilation (duration: 4s)
[info] multilingual_..._v2 Successful Compilation (duration: 8m 20s)
-> HEF 20.4 MB
So the allocator does not reject 16-bit per se. My earlier “any a16_w16 crashes” conclusion was an artifact of the fact that all four of my original 16-bit configs put the 16-bit on the attention softmax. The real picture is a cascade.
| Stage | Where 16-bit is applied | Failure |
|---|---|---|
| Wall 1 | a16_w16 on attention (softmax) |
BackendAllocatorException: ... unexpected crash (in compile()) |
| Wall 2 | a16 on conv layers |
nan exponent in conv_low_sub_layer/act_op (in optimize()) |
| Wall 3 | 16-bit activations on full-width layers | ... needs super-defuse (in compile()) |
2. Wall 1 — a16_w16 on attention → allocator crash
With 16-bit on the attention softmax, compile() crashes after finding valid partitions:
[error] Failed to produce compiled graph
[error] BackendAllocatorException: Compilation failed with unexpected crash
This is the same failure reported for a LightGlue self-attention block in Compiling with 16-bit inputs crashes (that thread is unresolved). Keeping softmax and the attention matmuls at 8-bit avoids it — which is what let me get past this wall and reach Wall 2.
3. Wall 2 — a16 on convs → nan exponent in the low sub-layer
Applying a16_w8 broadly (16-bit activations, 8-bit weights) dies in optimize(), first as:
File ".../acceleras/lossy_elements/quant_element.py", line 248, in a_b_factorize
b_fac = np.arange(max(np.floor(target / max_a), 1), max_b + 1)
ValueError: arange: cannot compute length
and, if that factorization is guarded, as:
hailo_model_optimization.acceleras.utils.acceleras_exceptions.AccelerasNegativeSlopesError:
the exponent [[nan]] is not in range [ 7 22] for layer
multilingual_..._v2/conv4/conv_low_sub_layer/act_op
What’s happening: the a16 conv decomposition splits each conv into a high and a low sub-layer. For this model, the low sub-layer’s scale computes to zero → exponent nan. Instrumenting a full pass shows this is true for all 72 conv low sub-layers — i.e. every conv’s weights are representable in the high 8-bit part, so the low residual is ≈0. (A useful implication: a16_w8 here is effectively 16-bit activations + 8-bit weights.)
It is not a calibration artifact. The degenerate set is unstable across calibration sets, and it persists with optimization_level=1 and with real, diverse calibration text (2,500 vault paragraphs). So richer calibration does not fix it.
I could only get past this with a runtime monkey-patch (not a supported path — flagging it only because it reveals Wall 3): force the degenerate low sub-layers to contribute exactly zero — check_exp_range → warn instead of raise, and nan_to_num(...) on the exported weights. With that, optimize() completes cleanly with 16-bit activations:
[H] optimize OK. sanitized low-sublayers: 72
4. Wall 3 — 16-bit activations → needs super-defuse on norm / elementwise
With optimize() done, compile() then fails:
[error] Failed to optimize bucket_0 with error: Assignment of normalization1 needs super-defuse
[error] BackendAllocatorException: Compilation failed: Assignment of normalization1 needs super-defuse
Setting the normalization layers back to a8_w8 just moves the same failure to the next wide layer:
[error] BackendAllocatorException: Compilation failed: Assignment of mul_and_add1 needs super-defuse
Interpretation: a full-width 16-bit activation tensor ([SEQ=256, HIDDEN=384]) exceeds a single resource assignment and needs to be super-defused. But DefuseType (.../script_parser/commands.py) only exposes conv variants — SUPER_CONV, SUPER_DW, SUPER_DECONV — so normalization / mul_and_add / elementwise layers appear to have no applicable super-defuse, and the allocator errors out instead of splitting them. Excluding one just exposes the next.
Net result: 16-bit activations do not compile for this encoder at SEQ=256 on DFC v5.3.0. I never obtained a valid 16-bit HEF, so the open quality question — does 16-bit recover the retrieval the 8-bit HEF loses (70.8% top-1 vs 100% FP32) — remains unmeasured.
5. What I tried (14 compiles, summarized)
a16_w16on one benign conv → compiles (§1).a16_w16/a16_w8on attention → Wall 1.a16_w8global,optimization_level0 and 1, synthetic and real vault calibration → Wall 2 (nan), calibration-independent.- Per-layer exclusion of degenerate convs → whack-a-mole; the set shifts with calibration.
- Runtime sanitize of degenerate low sub-layers →
optimize()passes → Wall 3. - Exclude normalization from 16-bit → Wall 3 moves to
mul_and_add.
6. Questions / requests for Hailo
- What is the intended way to compile 16-bit activations for a transformer encoder on the Hailo-10H? Is
a16_w16/a16_w8on the attentionsoftmax/matmulexpected to crash the allocator (Wall 1), and is there a supported config that lets 16-bit attention through — or is that only fixed in a later DFC? - The
a16conv low sub-layernanexponent (Wall 2): is a zero-valued low residual (weights fully representable in the high 8-bit part) a known degenerate case? Is there a supported directive to make such a low sub-layer contribute zero, rather than a user-side patch? needs super-defuseonnormalization/mul_and_add(Wall 3): is there a supported way to (super-)defuse wide 16-bit non-conv layers — e.g. anallocator_param/compilation_paramto enable automatic super-defuse — or is reducingSEQthe only option to bring the 16-bit activation footprint under the threshold?- Is any of the above resolved in a DFC newer than v5.3.0 (
hailo_ai_sw_suite_2026-04)?
Environment: RPi5 (aarch64) + Hailo-10H, HailoRT 5.3.0; compile host x86_64, DFC v5.3.0 (hailo_ai_sw_suite_2026-04:1), TF 2.19.1, CPU-only. Model: paraphrase-multilingual-MiniLM-L12-v2 (XLM-R, 12 layers, seq 256).
If you need more info please, let me know
Thanks!