Automatic Speech Recognition Pipeline with Whisper model

HerrB92 · April 23, 2026, 11:51am

it was easier than I remembered (and @user540 already posted some tests above as well):

Installed Hailo Apps (the version before the current one)
Updated to HailoRT 5.2.0 and Model Zoo 5.2.0 (as described here)
Downloaded Whisper-Small.hef and changed the model in the config file (as described here)
Used a slightly modified version of hailo-apps/hailo_apps/python/gen_ai_apps/voice_assistant/voice_assistant.py (without LLM):

The changed part in voice_assistant.py (LLM not actually used):

def on_audio_ready(self, audio):
        self.abort_event.clear()

        # 1. Transcribe
        user_text = self.s2t.transcribe(audio, language="de")
        if not user_text:
            print("No speech detected.")
            return

        print(f"\nYou: {user_text}")

        # 2. Output directly via TTS (no LLM)
        if self.tts:
            self.tts.clear_interruption()
            self.tts.queue_text(user_text.strip())

        # 3. Handshake: wait until TTS is finished, then listen again
        if self.interaction:
            try:
                self.interaction.restart_after_tts()
            except Exception:
                pass

These are the test results (menu output such as “Press SPACE to start/stop recording.” omitted):

python -m hailo_apps.python.gen_ai_apps.voice_assistant.voice_assistant
2026-04-23 13:21:07.779126749 [W:onnxruntime:Default, device_discovery.cc:325 DiscoverDevicesForPlatform] GPU device discovery failed: device_discovery.cc:92 ReadFileContents Failed to open file: "/sys/class/drm/card1/device/vendor"
Initializing AI components... (This might take a moment)
INFO | common.core | Using default model: Whisper-Small
INFO | common.core | Found HEF in resources: /usr/local/hailo/resources/models/hailo10h/Whisper-Small.hef
INFO | common.core | Found HEF in resources: /usr/local/hailo/resources/models/hailo10h/Qwen2.5-1.5B-Instruct.hef
✅ AI components ready!

==================================================
      Voice Assistant
==================================================
...

INFO | voice_processing.audio_player | Audio output stream started.
🔴 Recording started. Press SPACE to stop.
Processing... Please wait.

You:  Mal sehen, was jetzt passiert. Ich bin sehr gespannt. This is a test. Dies ist ein Test Understood.

Press SPACE to start recording.
...
Processing... Please wait.

You:  Mal sehen, was jetzt passiert. I am very curious. This is a test

In both cases, the same sentence in German was used:
“Mal sehen, was jetzt passiert. Ich bin sehr gespannt. Dies ist ein Test.”
(Translation: “Let’s see what happens. I am very curious. This is a test.”)

“Dies ist ein Test” always becomes “This is a test”. In one case, it was even detected twice, once in German and once in English, with an added “Understood” (which was not spoken).
“Ich bin sehr gespannt” was detected once correctly and once transcribed into English.

The result was similar for other texts in previous tests as well: a mixture of languages and sometimes also word fragments (for example, “curiou” instead of “curious”).

The microphone is fine: I tested 7 additional STT solutions with the same one.