Hi, thanks for notifications.
Are there any plans to support Hailo conversion for models larger than Base (e.g., Small, Medium, Large)?
Hi, thanks for notifications.
Are there any plans to support Hailo conversion for models larger than Base (e.g., Small, Medium, Large)?
@hyungjun_Byun support to larger Whisper models is planned for Hailo-10H architecture, which allows faster throughput when running those models.
For Hailo-8, larger models may be supported, but of course the FPS will be limited, making them less suitable for real time applications.
You can give a look at how we converted the tiny and base model in the hailo-whisper repo. For example, you can adapt the different scripts (export, convert_whisper_encoder, …) to add support for Whisper-small
Do you know if this example is already supposed to work for Hailo-10H, or not yet? Trying it now with 10H yields:
[HailoRT] [error] CHECK failed - Failed to create vdevice. there are not enough free devices. requested: 1, found: 0
[HailoRT] [error] CHECK_SUCCESS failed with status=HAILO_OUT_OF_PHYSICAL_DEVICES(74)
The check_installed_packaged.sh script shows “OK” for everything, so I’m not sure what else I can do to get this working (driver and HailoRT was built and installed from source as I don’t think there are binaries for 10H available yet)
Hi @Kieran_Coulter,
The example works with Hailo-10H as well, but the required version for that platform is 5.x, not 4.x (which is for Hailo-8/8L). For example, you can install version 5.1.0 to test the application. Apologies for the inconvenience, we will update the documentation accordingly.
Anyway, since you have Hailo-10H, we strongly recommend to move to the Speech-To-Text API, which is integrated in HailoRT and allows you to use optimized models from the Model Zoo.
Thanks Pierre. I have updated to the latest version but still get the same error.
I’m happy to move to the STT API, but I can’t find a speech recognition model in HEF format in the model zoo. Should I make my own instead? I imagine I will just need to convert a Whisper model to HEF format.
I see now, there is a hailo-whisper repository a few replies above with the instructions to convert the models ourselves.
One last question. I noticed this comment in the Conversion README:
”Since the embedding operators have been removed from the model at conversion time, they must run on the host CPU.”
Is this really the case - Whisper (as an ONNX or a HEF) cannot run on the NPU? This is a potential showstopper. If it’s true I won’t bother continuing with the conversion. I am curious why this would be the case.
Let me clarify the options:
The speech_recognition application was created for Hailo-8 and then made compatible for Hailo-10H. The models used in this application were created with the hailo-whisper repo. The decoder embeddings were removed at conversion time for compatibility with the Hailo-8 architecture, and must be run on the host CPU. All the other operators run on the accelerator.
Running the embeddings on the host CPU does not add much overhead.
For Hailo-10H only, we released other models in the Model Zoo Gen AI. These models can be used only with the Speech-To-Text API (and the models from the application example will not run with this API ).
When using the Speech-To-Text API, operations like tokenization and embeddings will be offloaded to the Hailo-10H completely.
The Whisper-Small-genai.hef from the GenAI Model Zoo v5.2.0 ( hailo_model_zoo_genai/docs/MODELS.rst at v5.2.0 · hailo-ai/hailo_model_zoo_genai · GitHub ) does not reliably respect the language parameter in the Speech2Text API. Even with language=“de” and task=Speech2TextTask.TRANSCRIBE, the model frequently outputs English instead of German. The Whisper-Base-genai.hef does NOT have this issue.
Environment
- Hardware: Raspberry Pi 5, 8GB + AI HAT+ 2 (Hailo-10H)
- HailoRT: 5.2.0
- Python: 3.13
- HEF files: Downloaded from hailo.ai/developer-zone (GenAI Model Zoo 5.2.0)
- Whisper-Base-genai.hef (131 MB) — works correctly
- Whisper-Small-genai.hef (388 MB) — language bug
- OS: Debian 13 (Trixie), aarch64
Reproduction
import numpy as np
from hailo_platform import VDevice
from hailo_platform.genai import Speech2Text, Speech2TextTask
vdevice = VDevice()
s2t = Speech2Text(vdevice, “/path/to/Whisper-Small-genai.hef”)
# audio_16k: German speech as float32, 16kHz, mono
text = s2t.generate_all_text(
audio_data=audio_16k,
task=Speech2TextTask.TRANSCRIBE,
language=“de”,
timeout_ms=30000,
)
Often outputs English instead of German!
Test Results
Test audio: German TTS (Piper de_DE-thorsten-medium), resampled to 16kHz float32 mono.
Whisper-Small GenAI (language=“de”, task=TRANSCRIBE):
Input: “Ich moechte gerne einen Termin vereinbaren.”
Output: “I would like to have a Termin Vereinbarer.” — English/mixed, WRONG
Input: “Wann haben Sie geoeffnet?”
Output: “When did they open?” — English, WRONG
Input: “Koennen Sie mich bitte zurueckrufen?”
Output: “koenne Sie mich bitte zurueckruven?” — German but with errors, PARTIALLY OK
With task=TRANSLATE all outputs are English (expected).
Whisper-Base GenAI (same API, same parameters) — works correctly:
Input: “Ich moechte gerne einen Termin vereinbaren.”
Output: “Ich moechte gerne einen Termin vereinbaren.” — CORRECT
Input: “Wann haben Sie geoeffnet?”
Output: “Wann haben Sie geoeffnet?” — CORRECT
Whisper-Base correctly respects language=“de” every time.
Analysis
According to MODELS.rst ( hailo_model_zoo_genai/docs/MODELS.rst at v5.2.0 · hailo-ai/hailo_model_zoo_genai · GitHub ), both HEFs are compiled from the multilingual HuggingFace models (openai/whisper-base and openai/whisper-small), not the .en variants.
Since the Speech2Text API offloads “tokenization and embeddings to the Hailo-10H completely” (as noted in the forum), the language token is processed on the NPU. The Whisper-Small quantization/compilation appears to have broken the language token handling, causing the model to default to English regardless of the language parameter.
Expected Behavior
language=“de” with task=Speech2TextTask.TRANSCRIBE should consistently produce German transcription, as it does with Whisper-Base GenAI.
Current Imperfect Workaround ![]()
Using Whisper-Base-genai.hef instead. On 8kHz telephone audio we measure:
- WER: 15.7% (German)
- Latency: 424ms average
- language=“de” works reliably