Sorry for this possibly basic question, I understand that running LLM typically requires significant GPU memory.
However, I’m curious if the Hailo-8L can assist in running local quantized PyTorch models in any capacity?
Sorry for this possibly basic question, I understand that running LLM typically requires significant GPU memory.
However, I’m curious if the Hailo-8L can assist in running local quantized PyTorch models in any capacity?
Our technology requires a model to be converted using the Hailo Dataflow Compiler. The workflow requires the original model to be in floating point format. The Hailo Dataflow Compiler will quantize the network during the optimization step. A model that was quantized in another framework cannot be converted because converting from one quantization scheme to another would lead to bad accuracy.
LLMs have many more parameters than conventional CNNs. The Hailo Dataflow Compiler can convert models that are larger than would fit into a single device. However for an LLM that would require 100+ context switches. This would require lot of work and RAM from the host. We therefore have designed a new product called the Hailo-10H. It has a DDR interface and a local DDR memory on the module to free the host from the context switch management.
Hailo-10H M.2 Generative AI Acceleration Module
Note: As of June 2024 the Hailo-10H is not yet available to a wider audience.
That sounds great! In that case, can I convert LLAMA3 8B?
Will there be a performance improvement even after context-switching for the quantized LLAMA3 from the Hailo Dataflow Compiler?
Hailo-10H looks very promising. I am looking forward to getting hands-on with the Hailo-10H.
Thanks!
Any possible price updates on this? PCIe3.0 x4 is a much more common interface and obviously much more bandwidth.
I presume the onboard memory makes a massive difference than marshalling allocated system mem.
I guess its aim is 7B & 13B param models?
Welcome to the Hailo Community!
We leave prices to our sales team. You will need to wait for general availability.
The Hailo-8 does support PCIe gen 3 and depending on the model of the module 2 (A+E, B+M) or 4 (key M) PCIe lanes. You can also connect one lane as done on the Raspberry Pi.
Yes, that is the idea.
Yes, the Hailo-10H will run large models in most cases. However there could be use cases where you load many small models on a Hailo-10H and switch between them without the host having to do the loading every time.
I have the same need. I interfaced the Hailo 8L with my Raspberry Pi 5, and I want to run a small LLM model like Phi 3 mini q4
Are you going to provide a version of Hailo-10 that has a DIMM slot? Would be really great if you could run much larger models on these things. e.g throw a 512gb DIMM on there so you can run with large context and/or large models.
Welcome to the Hailo Community!
Great question. We definitely understand the appeal of being able to drop in a huge DIMM and run very large models ![]()
That said, Hailo-10 is purpose-built for edge AI, where the main design goals are low power, low cost, small form factor, and high reliability. Because of that, the platform uses a fixed memory configuration instead of a socketed DIMM slot.
Hailo-10 supports up to 8 GB of DDR, and in most designs the memory is either integrated on the M.2 module itself or placed directly on the board next to the device. This keeps the system compact, power-efficient, and affordable.
Adding support for large DIMMs (e.g. 512 GB) would significantly increase size, power consumption, cost, and system complexity, and would shift the product away from its intended edge use case.