I was wondering if the team has any plans or ways to run the new Gemma 4 E4B 4 Bit Quant on the 10H? I wish this could be run on the Pi using the Ai Hat +2.
I agree with you, I’ve tried compiling it for Hailo 10h but haven’t been able to. I hope there’s official support soon.
Get your gloves on and get out a shovel. There’s a lot of work to be done to get that model on the Hailo-10H. I started late last week in their developer zone and it’s clear that 1. I need more time to understand how Hailo works and how its models are “optimized” to work on the hardware. 2. I may need to be smarter than I actually am to remove the blocker #1. I’ve been playing with the Llama3.2:3b model and if we don’t understand what specifically happened to that model to make it almost completely brain dead when moving to the HEF format, converting Gemma4 E4/2B models won’t matter.
Also, read Hailo’s press releases. The business model for them is to get these chips embedded in other hardware. The consumer gadget “add-on” market is really small, look at how long it took Raspberry Pi to become a household name. I’m a fan of keeping Hailo in business and them doing what they need to do to keep the lights on. This means the DIY sector will need to help eachother to make meaningful progress.
This would be amazing. Gemma4 E4B would be perfect for the Hailo10. Would really love to put that chip to work a bit more for openClaw purposes ![]()
I’m interested in people using Hailo 10 for openclaw. This application implies it is connected to the internet so it can “do stuff” which implies it’s plugged in. Which implies low power 15w vs 5w (10w difference) doesn’t really matter. What is your application that highlights the ability of the Hailo? A 16GB Pi5 can do what the Hailo can do… at a higher power draw with the added benefit that you can use regular Ollama models, not wait for Hailo’s format.
Hallo everyone,
I’m playing around a bit with Gemma (e2b). Problem with almost every model on the Hailo10 is the memory bandwith - its just above 17gb/s (same as raspberry 5). This means with 2.5GB-Model one achieves around 17.5/2.5 = 7 TPS theoretical maximum. To few for productive use.
But the hailo provides raw computing power. The NPU waits 99,8% just to get the weights. Here Speculative Tree Decoding comes into the system. Accounting just for raw NPU power, 128 (or 256) parallel token verification shouldn’t be a problem.
Points to consider:
- shared KV cache (otherwise at least for sram there would rise a huge issue - beside some added memory bandwidth problems)
- With 128 (/ 256) predictions, NGRAM is the only feasible predictor model I can think of
- Tree depth / width should adapt to job (chat [language] vs coding / agent). So the attention matrix for the total tree should be an input to the network to be flexible. Injection of weight matrices in the low level API would be an alternative option, but I don’t want to mess around with that.
- the ngram should provide attention matrix + all 128 prediction tokens.
- the ngram-lut should be calculated on the raspi cpu (as i understand, the npu is not flexible enough to do that fast)
- especially for gemma, the embedding tables (also the inner ones) should be provided by the cpu (already embedded vectors for the whole text input). This saves us some memory for the number crunching.
- as i understand, the 2 models hailo uses for the llms, one (prefill) is optimized for batch processing, one for token by token inference. since every inference is batch for my model, i don’t need a “inference model”. I’m using the same for prefill as for batch inferencing. This half’s the memory space requirements in comparison to the current hailo LLM models.
For coding and agents this should give some speedup…
Currently I’m messing around with ONNX graph-surgeon, hailo DFC and also some alternative LLM models (Qwen). But the struggles with the whole setup is real, so maybe the next DFC iterations provide a bit more helpful tools and guides for LLM compilation. In addition the actual hardware constraints would be nice to know - to know if it’s possible to implement this (sram space, possible atomistic operations, int4 / int8 capabilities, …)
What are your thoughts about this setup?
Can’t wait to be able to play with Gemma 4 on my AI Hat+ 2 (Hailo-10).
As far as I know, there are people already trying to use the Hailo compiler (DFC) to convert gemma4:e4b to .hef format, but they are encountering conversion errors because Gemma 4’s architecture is very different from version 3. IT seems the NPU doesn’t readily understand these new layers without a major update to Hailo’s software…