Running local LLM using Hailo-8L

alok.ranjan · June 14, 2024, 3:31pm

Sorry for this possibly basic question, I understand that running LLM typically requires significant GPU memory.

However, I’m curious if the Hailo-8L can assist in running local quantized PyTorch models in any capacity?

KlausK · June 15, 2024, 12:48am

Our technology requires a model to be converted using the Hailo Dataflow Compiler. The workflow requires the original model to be in floating point format. The Hailo Dataflow Compiler will quantize the network during the optimization step. A model that was quantized in another framework cannot be converted because converting from one quantization scheme to another would lead to bad accuracy.

LLMs have many more parameters than conventional CNNs. The Hailo Dataflow Compiler can convert models that are larger than would fit into a single device. However for an LLM that would require 100+ context switches. This would require lot of work and RAM from the host. We therefore have designed a new product called the Hailo-10H. It has a DDR interface and a local DDR memory on the module to free the host from the context switch management.

Hailo-10H M.2 Generative AI Acceleration Module

Note: As of June 2024 the Hailo-10H is not yet available to a wider audience.

alok.ranjan · June 15, 2024, 5:21am

That sounds great! In that case, can I convert LLAMA3 8B?
Will there be a performance improvement even after context-switching for the quantized LLAMA3 from the Hailo Dataflow Compiler?

Hailo-10H looks very promising. I am looking forward to getting hands-on with the Hailo-10H.

Thanks!

stuartiannaylor · June 16, 2024, 7:38pm

Any possible price updates on this? PCIe3.0 x4 is a much more common interface and obviously much more bandwidth.
I presume the onboard memory makes a massive difference than marshalling allocated system mem.

I guess its aim is 7B & 13B param models?

KlausK · June 17, 2024, 2:24pm

Welcome to the Hailo Community!

We leave prices to our sales team. You will need to wait for general availability.

The Hailo-8 does support PCIe gen 3 and depending on the model of the module 2 (A+E, B+M) or 4 (key M) PCIe lanes. You can also connect one lane as done on the Raspberry Pi.

Yes, that is the idea.

Yes, the Hailo-10H will run large models in most cases. However there could be use cases where you load many small models on a Hailo-10H and switch between them without the host having to do the loading every time.

orso.eric · September 13, 2024, 11:13am

I have the same need. I interfaced the Hailo 8L with my Raspberry Pi 5, and I want to run a small LLM model like Phi 3 mini q4

Topic		Replies	Views
Can the Raspberry Pi AI Kit support running large language models (LLMs)? General hailo8	3	8362	May 8, 2025
Hailo-8 for attention-based object detection models General raspberry-pi , hailo8	3	43	July 16, 2025
Translating LLM (llama 3 8b) Fails General dfc , raspberry-pi , hailo8 , error	12	1373	September 13, 2024
Inquiry: Updates on LLM Deployment and Whisper Integration on Hailo's NPU General dfc , raspberry-pi , hailo8	5	1385	January 14, 2025
Convert a language model from onnx to hef General raspberry-pi , hailo8	1	441	January 12, 2025

Running local LLM using Hailo-8L

Related topics