Automatic Speech Recognition Pipeline with Whisper model

@Michael,

it was easier than I remembered (and @user540 already posted some tests above as well):

  • Installed Hailo Apps (the version before the current one)
  • Updated to HailoRT 5.2.0 and Model Zoo 5.2.0 (as described here)
  • Downloaded Whisper-Small.hef and changed the model in the config file (as described here)
  • Used a slightly modified version of hailo-apps/hailo_apps/python/gen_ai_apps/voice_assistant/voice_assistant.py (without LLM):

The changed part in voice_assistant.py (LLM not actually used):

def on_audio_ready(self, audio):
        self.abort_event.clear()

        # 1. Transcribe
        user_text = self.s2t.transcribe(audio, language="de")
        if not user_text:
            print("No speech detected.")
            return

        print(f"\nYou: {user_text}")

        # 2. Output directly via TTS (no LLM)
        if self.tts:
            self.tts.clear_interruption()
            self.tts.queue_text(user_text.strip())

        # 3. Handshake: wait until TTS is finished, then listen again
        if self.interaction:
            try:
                self.interaction.restart_after_tts()
            except Exception:
                pass

These are the test results (menu output such as “Press SPACE to start/stop recording.” omitted):

python -m hailo_apps.python.gen_ai_apps.voice_assistant.voice_assistant
2026-04-23 13:21:07.779126749 [W:onnxruntime:Default, device_discovery.cc:325 DiscoverDevicesForPlatform] GPU device discovery failed: device_discovery.cc:92 ReadFileContents Failed to open file: "/sys/class/drm/card1/device/vendor"
Initializing AI components... (This might take a moment)
INFO | common.core | Using default model: Whisper-Small
INFO | common.core | Found HEF in resources: /usr/local/hailo/resources/models/hailo10h/Whisper-Small.hef
INFO | common.core | Found HEF in resources: /usr/local/hailo/resources/models/hailo10h/Qwen2.5-1.5B-Instruct.hef
✅ AI components ready!

==================================================
      Voice Assistant
==================================================
...

INFO | voice_processing.audio_player | Audio output stream started.
🔴 Recording started. Press SPACE to stop.
Processing... Please wait.

You:  Mal sehen, was jetzt passiert. Ich bin sehr gespannt. This is a test. Dies ist ein Test Understood.

Press SPACE to start recording.
...
Processing... Please wait.

You:  Mal sehen, was jetzt passiert. I am very curious. This is a test

In both cases, the same sentence in German was used:
“Mal sehen, was jetzt passiert. Ich bin sehr gespannt. Dies ist ein Test.”
(Translation: “Let’s see what happens. I am very curious. This is a test.”)

  • “Dies ist ein Test” always becomes “This is a test”. In one case, it was even detected twice, once in German and once in English, with an added “Understood” (which was not spoken).
  • “Ich bin sehr gespannt” was detected once correctly and once transcribed into English.

The result was similar for other texts in previous tests as well: a mixture of languages and sometimes also word fragments (for example, “curiou” instead of “curious”).

The microphone is fine: I tested 7 additional STT solutions with the same one. :wink: