Running local LLM using Hailo-8L

Sorry for this possibly basic question, I understand that running LLM typically requires significant GPU memory.

However, I’m curious if the Hailo-8L can assist in running local quantized PyTorch models in any capacity?

1 Like

Our technology requires a model to be converted using the Hailo Dataflow Compiler. The workflow requires the original model to be in floating point format. The Hailo Dataflow Compiler will quantize the network during the optimization step. A model that was quantized in another framework cannot be converted because converting from one quantization scheme to another would lead to bad accuracy.

LLMs have many more parameters than conventional CNNs. The Hailo Dataflow Compiler can convert models that are larger than would fit into a single device. However for an LLM that would require 100+ context switches. This would require lot of work and RAM from the host. We therefore have designed a new product called the Hailo-10H. It has a DDR interface and a local DDR memory on the module to free the host from the context switch management.

Hailo-10H M.2 Generative AI Acceleration Module

Note: As of June 2024 the Hailo-10H is not yet available to a wider audience.

2 Likes

That sounds great! In that case, can I convert LLAMA3 8B?
Will there be a performance improvement even after context-switching for the quantized LLAMA3 from the Hailo Dataflow Compiler?

Hailo-10H looks very promising. I am looking forward to getting hands-on with the Hailo-10H.

Thanks!

Any possible price updates on this? PCIe3.0 x4 is a much more common interface and obviously much more bandwidth.
I presume the onboard memory makes a massive difference than marshalling allocated system mem.

I guess its aim is 7B & 13B param models?

Welcome to the Hailo Community!

We leave prices to our sales team. You will need to wait for general availability.

The Hailo-8 does support PCIe gen 3 and depending on the model of the module 2 (A+E, B+M) or 4 (key M) PCIe lanes. You can also connect one lane as done on the Raspberry Pi.

Yes, that is the idea.

Yes, the Hailo-10H will run large models in most cases. However there could be use cases where you load many small models on a Hailo-10H and switch between them without the host having to do the loading every time.

3 Likes