Has anyone successfully converted any text-to-speech (TTS) models to run on the Hailo-8 / Hailo-8L?

Hello all! I’m wondering if anyone has has any success in converting and running text-to-speech (TTS) models on the Hailo-8 / Hailo-8L, such as FastSpeech2, VITS, VITS2, Piper VITS, StyleTTS2, etc?

These TTS models tend to be in the 15-100mb range in FP32 – depending on the specific model configuration – and have somewhere between 5m and 100m parameters.

They also tend to primarily use convolutional layers, just like many of the CNN’s that have been successfully converted to run on Hailo modules, along with some self-attention layers.

Before I try to take a stab at attempting to get a VITS or VITS2 TTS model converted to .hef and running on Hailo, I was wondering if anyone else might have already attempted or succeeded at doing so – and if they can share their experience, or even their model files.

I’ve yet to see a TTS model quantized to INT-8, but I’ve had success with Piper VITS onnx models from FP32 to FP16 without any reduction in performance. I’m curious to try some INT-8 quantized aware fine-tuning of a VITS model and converting it to .HEF – but it would be great if anyone already had experience with this that they can share.

Thank you!

Hi @salizaidiai,
Hailo currently doesn’t support any TTS models.

Regards,

Thanks Omer! If you have any insights to share, I’m curious to understand why TTS models such as VITS / VITS2 could not be converted to run on Hailo?

The model sizes / number of parameters within such models are in the ball park of CV models supported on Hailo – so if I had to guess, there’s something unique within Conditional Variational Autoencoder models (in comparison to Vision Transformers, Vision CNNs, and Whisper models) that is not compatible with Hailo-8’s functionality?

Are there certain kinds of layers or operations on TTS models like VITS/VITS2 that the Hailo is particularly ill-suited for? Or does the limitation come down to some other factors, like SRAM / PCIE Bandwdith?

Hi @salizaidiai,
My guess is that there are unsupported operations there, as Hailo was initially focused on computer vision, so the support currently implemented would be for the neural operations in CV models.
Since TTS have different architecture compared to a standard CNN models, it’s not likely that the Hailo SW currently supports these kind of models. We are progressing towards supporting more networks with that perform other tasks, but it’s a work in progress.

You can try to export the models you mentioned to ONNX\tflite and check if and where the SW fails (most likely it would fail in the Parser step).

Regards,