Raspberry Pi AI+ HAT disappointment

Hailo-8 AI HAT: Disappointment with Limited Model Support Beyond Vision

Background

I recently purchased a Hailo-8 AI HAT for my Raspberry Pi 5 (£106 - nearly twice the cost of the Pi itself at £60) with the hope of accelerating AI workloads for an audio processing project. After spending several days attempting to compile models, I’ve discovered that the NPU appears to be exclusively designed for vision tasks, which was not clear from the marketing materials.

I’m writing this post to:

  1. Share my technical findings about what doesn’t work and why
  2. Provide feedback to Hailo about unclear product positioning
  3. Help other potential buyers understand the limitations
  4. Express hope that future products will support broader AI workloads

What I Tried to Accomplish

Project Goal

I’m building a system to detect news segments in radio recordings. My pipeline:

  1. Audio classification (YAMNet) - Identify speech vs music
  2. Speech-to-text (Whisper) - Transcribe speech regions
  3. Text classification (TinyBERT) - Detect news segment markers

I hoped the Hailo-8 could accelerate steps 1 and/or 3.


Attempt 1: YAMNet Audio Classification

What YAMNet Is

  • Google’s audio event classifier
  • Input: 1D audio waveform (15,600 samples)
  • Output: 521 audio event classes (speech, music, etc.)
  • Model size: ~4MB
  • Architecture: CNN-based, very lightweight

Why I Thought It Would Work

  • YAMNet uses convolutional layers (CNNs)
  • Small model (4M parameters)
  • Hailo documentation mentions “various neural network architectures”
  • TensorFlow model available (Hailo supports TensorFlow)

Compilation Attempt

# Successfully downloaded and converted to ONNX
wget https://tfhub.dev/google/yamnet/1?tf-hub-format=compressed
python3 -m tf2onnx.convert --saved-model yamnet_savedmodel --output yamnet.onnx

# Hailo compilation failed
python3 hailo_compile.py

Error Encountered

TypeError: object of type 'NoneType' has no len()

The Hailo parser failed because:

  • YAMNet uses 1D input (raw audio waveform)
  • Hailo expects 2D/3D input (images: Height × Width × Channels)
  • The compiler cannot parse 1D sequential data

Technical Root Cause

The Hailo-8 NPU architecture is fundamentally designed for spatial operations on 2D/3D tensors:

  • Expects inputs shaped like images: [batch, height, width, channels]
  • YAMNet input: [batch, 15600] (1D waveform)
  • Incompatible at the architectural level

Even though YAMNet uses CNNs (which Hailo supports), the 1D nature of audio waveforms makes it incompatible with Hailo’s spatial processing paradigm.


Attempt 2: TinyBERT Text Classification

What TinyBERT Is

  • Distilled BERT variant (4M parameters, ~16MB)
  • Input: Text sequences (tokenized)
  • Output: 3 classes (NEWS_START, NEWS_END, OTHER)
  • Use case: Classify transcript sentences to find news segments

Why I Thought It Might Work

  • Hailo’s website mentions “transformer support”
  • Small enough to fit in NPU memory (4M params)
  • Transformers use matrix operations (similar to CNNs)
  • Fixed-size inputs (not autoregressive like LLMs)

Research Findings

After searching Hailo’s Model Zoo and community forums:

Model Zoo:

  • 0 NLP models listed
  • 0 text classification models
  • 0 BERT variants
  • All models are vision-only (YOLO, ResNet, EfficientNet, etc.)

Community Forums:

  • Multiple users asked about BERT/DistilBERT - no successful implementations
  • Official Hailo response: “Language models too large for Hailo-8L”
  • For Hailo-8 text questions: Users redirected to Hailo-10H instead
  • “Transformer support” refers to Vision Transformers (ViT for images), not NLP

Missing Operators:
Checked Hailo’s supported operator list - no mention of:

  • Attention mechanisms (core of BERT)
  • Token embeddings
  • Layer normalization (different from batch norm)
  • Sequential processing primitives

Technical Root Cause

BERT architecture is incompatible with Hailo-8:

  1. Embedding Layers: BERT starts with large vocabulary lookup tables

    • 30,522 vocab × 128 hidden_dim = 4M parameters just for embeddings
    • Hailo optimized for convolutional operations, not embedding lookups
  2. Self-Attention: Multi-head attention requires:

    • Q, K, V projections
    • Scaled dot-product attention
    • Softmax over sequence length
    • Not in Hailo’s documented operator support
  3. Sequential vs Spatial: Text is 1D sequential data

    • Hailo designed for 2D spatial data (images)
    • Architecture mismatch at fundamental level

What Actually Works on Hailo-8

Based on the Model Zoo, these models are supported:

  • Object Detection: YOLO, SSD, RetinaNet
  • Image Classification: ResNet, EfficientNet, MobileNet
  • Segmentation: U-Net, DeepLab
  • Pose Estimation: PoseNet, OpenPose
  • Vision Transformers: ViT (for images, not text)

Common pattern: All inputs are 2D/3D images (H×W×C format)


My Disappointment

1. Misleading Product Positioning

The Hailo-8 is marketed as an “AI accelerator” with support for “various neural network architectures” and “transformers.”

Reality:

  • It’s specifically a vision AI accelerator
  • “Various architectures” means CNN variants (ResNet, YOLO, etc.)
  • “Transformer support” means Vision Transformers (ViT), not NLP

This should be stated clearly upfront. The product is excellent at what it does, but what it does is only vision.

2. The Compiler Appears Pointless

Hailo provides a “Dataflow Compiler” to compile custom models. This sounds promising - “bring your own model!”

Reality:

  • The compiler only works with vision models
  • If your input isn’t an image, it will fail
  • The Model Zoo has zero examples of non-vision AI
  • Community has zero success stories with audio or text models

The compiler gives the illusion of flexibility when it’s actually quite restricted. Why provide compilation tools if they only work for the same narrow category of models that already have pre-compiled versions?

3. Cost vs. Functionality

  • Raspberry Pi 5 (8GB): £60
  • Hailo-8 AI HAT: £106
  • Ratio: NPU costs 177% of the computer it plugs into

For £106, I expected broader AI acceleration capability. Instead, I have:

  • ✗ Cannot accelerate audio classification (YAMNet)
  • ✗ Cannot accelerate text classification (BERT)
  • ✗ Cannot accelerate speech-to-text (Whisper - “might work” but not available)
  • ✓ Can accelerate… camera feeds I don’t have

The NPU costs more than my computer, but I can’t use it for my use case.

4. Documentation Gaps

What would have saved me days of effort:

  • Clear statement: “Hailo-8 supports vision models only”
  • Explicit list of supported input types: “Images (H×W×C) only”
  • FAQ: “Can I run audio models?” → “No, 1D inputs not supported”
  • FAQ: “Can I run NLP models?” → “No, use Hailo-10H instead”

Instead, I had to:

  • Spend hours setting up WSL2 and compilation environment
  • Download and install 489MB compiler
  • Attempt compilations that were doomed to fail
  • Search forums to discover others had the same issues
  • Piece together that it’s vision-only from absence of counter-examples

Technical Questions for Hailo

1. Why Can’t 1D Inputs Be Supported?

Audio waveforms are just 1D CNNs - conceptually, they’re images with height=1. Could the compiler treat them as such?

Example: Reshape [15600][1, 15600, 1] (treat as 1-pixel-tall image)

Is this a fundamental hardware limitation, or just a compiler limitation?

2. Why No NLP Model Examples?

Vision Transformers (ViT) work on Hailo-8. These use:

  • Self-attention (same as BERT)
  • Layer normalization (same as BERT)
  • Matrix operations (same as BERT)

Question: If ViT works, why can’t BERT? Is it just the embedding layer? Could embeddings be pre-computed on CPU?

3. What About Spectrograms?

Audio can be converted to spectrograms (2D images). Could Hailo-8 run:

  • Audio → mel-spectrogram (CPU) → CNN classifier (NPU)?

This would be 2D input, which Hailo supports. Any examples of this approach?

4. Future Roadmap?

  • Will Hailo-8 ever support non-vision models?
  • Is Hailo-10H the only option for audio/text?
  • What’s the timeline for Whisper support (mentioned in forums)?

Constructive Suggestions

For Product Marketing

  1. State clearly: “Hailo-8 is a vision AI accelerator”
  2. List supported input types explicitly: “2D/3D image tensors only”
  3. Add FAQ section: “What models won’t work?”
  4. Comparison table: Which Hailo chip for which workload?

For Documentation

  1. Add “Unsupported Use Cases” section
  2. List operator limitations upfront
  3. Provide troubleshooting for common compilation errors
  4. Add examples of what won’t compile (not just success stories)

For Future Hardware

  1. Consider audio as a first-class use case (1D CNNs)
  2. Broader operator support (embeddings, attention for NLP)
  3. Hailo-10H should be marketed as “required” for non-vision AI

Hope for the Future

I genuinely believe Hailo makes excellent hardware for vision tasks. The chip is powerful, efficient, and well-integrated with Raspberry Pi.

My hope:

  • Future Hailo products support broader AI workloads
  • Better documentation helps users understand limitations upfront
  • Audio and text AI get the same first-class support as vision
  • The compiler becomes truly flexible (not just vision-model-only)

Conclusion

Current State:

  • Hailo-8 AI HAT: £106
  • Use cases for my project: 0
  • Time spent attempting compilation: ~8 hours
  • Result: Nothing I needed works

Lesson Learned:
The Hailo-8 is a vision AI accelerator, not a general-purpose AI accelerator. This should be stated clearly in marketing materials.

Recommendation for Future Buyers:

  • ✓ Building a security camera / object detection / pose estimation? Buy it!
  • ✗ Working with audio, text, or non-vision AI? Save your money.

I’m sharing this in hopes it helps others avoid the same disappointment, and to provide feedback to Hailo about positioning and documentation.

Has anyone had success with non-vision models on Hailo-8? I’d love to be proven wrong!


My Setup:

  • Raspberry Pi 5 (8GB)
  • Hailo-8 AI HAT (M.2)
  • Windows PC with RTX 3050 (for compilation)
  • Hailo Dataflow Compiler 3.33.0

Models Attempted:

  • YAMNet (audio classification) - Failed (1D input unsupported)
  • TinyBERT (text classification) - Research showed futility before attempting

Outcome:

  • CPU-based implementation works fine (just slower)
  • NPU remains unused despite £106 investment

Hi @Roger_Walker,

Thank you for the lenghtly post. It’s clear that you’ve put a lot of thought into it, despite the disapoitment that you’ve experienced.

I will try to be concise here - yes, most people couple the Hailo-8 with a vision sensor (a.k.a camera). This is the main reason that our offering is tilted towards vision based tasks. That said, we’ve seen projects that involve Radar, Lidars, and more.

Specifically for your project, I believe that Hailo could cover the Whisper (already supported) and the YAMnet models. BERT would remain out of scope for the time being.

I’ve took a quick stab at YAMnet, and was able to acertain that it’s fully supported on Hailo. What is the caveat? You need to use the unquantized model from Google. Here’s what I’ve done:

Export the model from TF:

git clone https://github.com/tensorflow/models.git

cd models/research/audioset/yamnet

python -m venv yamnet_env

. yamnet_env/bin/activate

pip install tensorflow tensorflow_hub tensorflowjs

curl -O https://storage.googleapis.com/audioset/yamnet.h5

python export.py ./yamnet.h5 models/

Covert the tflite model to Hailo (NOTE: I’m using here random calibset, since this is juts to assert proof-of-concept. TO get meaningful reasult use real data)

hailo parser tf YAMnet/models/research/audioset/yamnet/models/tflite/yamnet.tflite --start-node-names “yamnet_frames/layer1/relu/Relu;yamnet_frames/layer1/conv/bn/FusedBatchNormV3;yamnet_frames/layer1/conv/Conv2D” --end-node-names “StatefulPartitionedCall:0” “StatefulPartitionedCall:2”

hailo optimize YAMnet/yamnet.har --use-random-calib-set

hailo compiler YAMnet/yamnet_optimized.har

thanks. the long post was put together by Claude. I will try what you have suggested. Thanks for your help.

how would i use the whisper model?

Hailo-Application-Code-Examples/runtime/hailo-8/python/speech_recognition at main · hailo-ai/Hailo-Application-Code-Examples · GitHub

1 Like

thanks i will try again

Hi @Roger_Walker,

Thanks for detailed post. I just wanted to add a few points regarding Whisper conversion for Hailo-8, which may be helpful also when converting other models with similar architecture.

First of all - as you pointed out - it’s important to understand that Whisper models use Log-mel spectrograms for the input data, which are not that different from an image and basically shift the problem from the time domain to the vision domain.

Please give a look at the hailo-whisper repository to get an overview of the steps we performed to convert the Whisper tiny / base models for Hailo-8. In the patch file, you could see that we applied some modifications to the original PyTorch code to make the model compatible with the Hailo-8 architecture, for example:

  • Replaced Conv1D with Conv2D, which are supported on Hailo-8
  • Reduced the number of input audio features to make the model smaller and easier to compile (this problem was specific to this model only)
  • Fixed the maximum sequence length for the decoder (in the export script), since Hailo-8 does not support dynamic shapes.

Some of these modifications may be useful also for other models that are not available in the Model Zoo.

This shows that even transformer-based models that are not available in the Model Zoo can be converted from scratch, but the complexity of this operation may vary depending on the model itself.