We are excited to share that we’ve been working on an automatic speech recognition pipeline based on the Whisper-tiny model, running it on a Raspberry Pi with our Hailo-8/8L AI accelerator.
Given how much interest this topic has generated in the community, we wanted to provide a quick status update.
Our focus so far has been on optimizing performance to make the pipeline faster and more efficient for real-time applications. The attached video showcases some of our initial results, and we’re actively refining the system to improve latency, resource usage, and overall efficiency.
This is not an official release yet, but we plan to share a working application with the Community soon. In addition, we are also evaluating larger Whisper variants to explore the trade-offs between accuracy and performance.
If you have any thoughts, suggestions, or experience with similar optimizations, we’d love to hear from you! Stay tuned for more updates as we continue improving the pipeline!
Very cool - i got this working.
Are there some instructions on how to get/train/convert whisper tiny models for other languages?
Is there a repository that shows how the en hef model was created?
I am really excited for this one. Could you please provide the YAML recipe and any required assets (ONNX, checkpoints, HEF files) for the whisper_tiny_encoder_10s_15db and decoder models used in your speech_recognition demo? I feel like I may be missing something obvious.
The Whisper model is multi-language, but for this specific project we used English audio for calibration, and only English is considered in the decoding phase.
We are discussing internally about releasing the conversion flow. In case, we will update this thread with an ETA.
Hallo, Do you plan to create instructions in the future on how to create models for other languages so that everyone can configure other languages themselves?
Yes, we will provide the conversion script int he coming months. Actually, since we are already using the multilingual version of Whisper (see here, you will simply need to calibrate the model with audio from a specific language, or from multiple languages, In our example, calibration was done only on English audio.
@Miah_Way the installation instructions in the Github page are valid also for the Raspberry Pi5. The SDK is not needed, since the app downloads a precompiled version of the model.
Also, the SDK (i.e. the DataFlow Compiler) can run only on an x86 machine, not on the Pi.