Whisper Full Release - Now Available!

Hey everyone! :waving_hand:

We’re thrilled to share some big news: our Whisper automatic speech recognition pipeline is now officially available for the community! :tada:


Our Journey So Far

A couple of months back, we gave you a sneak peek at our early work on an ASR pipeline using the Whisper-tiny model on a Raspberry Pi with our Hailo-8/8L AI accelerator. Back then, our main goal was to crank up the performance for real-time applications.

The response from all of you was absolutely incredible! Your enthusiasm truly fueled us to push harder and refine the system even further.Plus, we’ve been experimenting with larger Whisper models to find that sweet spot between accuracy and speed.


What’s New in This Full Release!

After weeks of solid optimization and refinement, we’re finally ready to drop the complete package, packed with some serious improvements:

  • Dual Model Support: We’ve gone beyond just Whisper-tiny! You can now also use the Whisper-base model, which has twice the parameters for significantly better accuracy and precision. You get to choose between:

    • Whisper-tiny: If you need lightning-fast processing. :rocket:
    • Whisper-base: For those times when you need enhanced accuracy for more demanding tasks. :bullseye:
  • Streamlit GUI: We built a super intuitive, user-friendly interface that makes getting started a breeze. Seriously, just hit a button and start recording – no complicated setup required! :exploding_head:

  • Optimized Performance: All the performance tweaks we’ve been working on are baked right in, meaning you’ll get faster processing and even better resource efficiency on your Raspberry Pi + Hailo-8/8L setup. :flexed_biceps:

  • Complete Documentation: We’ve put together a comprehensive README with step-by-step instructions.:books:


Ready to Get Started?

This release includes everything you need to dive right in:

  • A ready-to-use Streamlit application
  • Support for both Whisper models
  • Handy compilation scripts for easy setup
  • And of course, comprehensive documentation

Both models are currently tuned for English, which should cover most use cases. We’re definitely thinking about adding multi-language support in future updates, so keep that feedback coming!


What’s Next for Us

This whole journey has been amazing, and seeing how excited the community is has truly motivated us. We’re already brainstorming potential improvements and would absolutely love to hear how you’re using the pipeline in your own projects.

Go on, give it a try and let us know what you think! Your feedback is what helps us make these tools even better.

Happy building, everyone! :hammer_and_wrench:


As always, if you hit any snags or have suggestions, please don’t hesitate to reach out. The community’s input has been absolutely invaluable in getting us to this point. :folded_hands:

9 Likes

Hello Hailo team,
Do you have an estimated timeline for releasing a compilation script that supports fine‑tuned or multilingual Whisper models?

Hey @Hojeong_Kim ,

Welcome to the Hailo Community!

Yeah We just got the timeline for the compilations scripts and flags release , it will be in the end of august !

2 Likes

Dear Hailo team,

I’m truly delighted to hear this news.

Thank you for all your hard work, and I hope to get my hands on a real Hailo 10H very soon.

Have a great day!

Jusung Kang.

2 Likes

hey just wanted to check in on the conversion script.

I’ve been spending quite some time trying to fix the conversion (at least of the encoder part of the whisper model) and running into lots of issues. I was able to convert Whisper’s Conv1d layers to Conv2d and got past the initial convolution operations, but my conversion ultimately failed
at the positional embedding addition step due to tensor shape incompatibilities between my 4D preprocessing approach and Hailo’s normalization layer
expectations.

Would love to see how you solved the conversion from onnx!

Hi Hailo Community!

Following the many requests we received in the last months, we are glad to finally share with you the hailo-whisper repository!
If you’re working with speech recognition on the edge, this repo shows you how to bring OpenAI’s Whisper model to Hailo AI accelerators (Hailo-8 & Hailo-10H).

:white_check_mark: Support for Whisper tiny / base (and .en variants)
:white_check_mark: Ready-to-use tools to export, convert, and evaluate models

This repo allows you to generate compiled tiny/base models that can then be integrated into our speech recognition app, but it also serves as a guideline to bring your own fine-tuned model onto the Hailo hardware.

The project is under active development, with additional improvements planned in the coming months.
Feel free to share any thoughts or recommendations for improving the repo; your input will help the community get more value from it.

@Hojeong_Kim @Katrin_Tomanek please find the conversion scripts in the repo above.

@Katrin_Tomanek regarding your question, as you can see in the repo (convert_whisper_decoder.py), the token embedding operator have been excluded from the decoder conversion, saved as npy files and executed on the host (in the application example)

hey @pierrem thanks so much for making the conversion script available.

I took a closer look at your documentation as well as at the code to understand how the conversion is done. Doing so I noticed that you are expecting the original OpenAI Whisper models.

In my case, I am using the HuggingFace transformers version of whisper for fine-tuning my models and then export to ONNX via Optimum (using ORTModelForSpeechSeq2Seq).

I am not able to use my converted onnx models (with the process described above) in your whisper conversion script for Hailo, because I am getting shape errors like this (this is just the encoder):

hailo_sdk_common.hailo_nn.exceptions.UnsupportedModelError: Invalid kernel shape for base conv layer base_conv1 (translated from /conv1/Conv).

Either the input shape doesn’t match the kernel shape, or the calculated groups number doesn’t match the expected ratio between kernel shape and input shape.

Kernel features: 80 Input features: 3000 Groups: 37

And recommendations how to address this ? Would you suggest me first converting my HF transformer whisper model to OpenAI’s whisper model (How?). Or is there some sort of operation I can do on the ONNX file to make it match ?

Thanks

Katrin

@Katrin_Tomanek I haven’t tried the HF model, but I assume the internal architecture to be quite similar to the OpenAI one.
I think that the original model should be modified to make the model compilable for Hailo-8, e.g. replace Conv1D with Conv2D, reduce the number of input audio features (e.g. to 1000, for a 10s audio) and export the decoder to ONNX with fixed output sequence.
Please look at the patch file in the repo to see the modifications we applied to the OpenAI model.
I will give a look at the HF model next week. Do you think that using the HF model will bring an advantage compared to the OpenAI one? (e.g. easier finetuning, …)

Send you a PM.

And yes, HF Whisper would be much preferred – easier or more common fine-tuning using HF transformers…

I have just gotten the base Whisper model to work on my Hailo 8. I’m looking forward to better models being run on this Hailo module.

Looking forward to more additions to the GitHub repo. I haven’t seen any new changes posted on there in a few months so just curious if this is still being worked on. What can we in the community help with this development? I am looking for 'Whisper Small’ quality models running on it.