Hailo-Ollama Server with bigger model support?

jeff.singleton · January 25, 2026, 4:33pm

Hello All…

I received my Hailo10h chip and got everything compiled and running on my Raspberry Pi 5 under regular old RaspianOS. Then I wrote a custom “Hailo Assistant” device for the Home Assistant (based off the Ollama device) to talk to Hailo-Ollama also running on the Pi 5 and using the Hailo10h chip for inference.

With all that said, I am getting really bad responses from all of the current list of models supported by the Hailo-Ollama server. The RPi5 with 8GB memory can easily run larger more accurate models. How exactly can one of the models listed here be integrated into the Hailo-Ollama server: hailo_model_zoo_genai/docs/MODELS.rst at main · hailo-ai/hailo_model_zoo_genai · GitHub

Is there a workflow for creating a proper manifest.json file so that Hailo-Ollama will see and use a new model? Or do I just copy one and modify it?

I’m real close to success on this project. I just need a better model.

Jeff

jeff.singleton · February 5, 2026, 1:22am

How about a Hailo-Ollama server that supports tools? Because tools support is absolutely necessary for a Personal Assistant on the Home Assistant. Without tools, I will have to pre-program every command I want to automate into the code.

Michael · February 5, 2026, 7:00am

Hi @jeff.singleton ,
For time being only the supported list of models work with hailo-ollama, and from all our supported GenAI models (hence compiled to HEF) - the relevant for ollama use case - are in the hailo-ollama supported list…
We do plan in the future to release LoRA fine tuning.

Michael · February 5, 2026, 7:01am

@jeff.singleton We are on it and looking into that. In the meanwhile please see the community project at: hailo-ollama tools support - #5 by TheRubbishRaider

jordanskole · February 5, 2026, 8:51pm

@jeff.singleton I want to try and tackle compiling some larger models too, that might be next.

My understanding is that we should be able to compile modules to the hef with the dataflow compiler but it doesn’t look like ARM is supported (and it wants minimum 16gb ram) so I think that might have to get handed off to a bigger hosted machine.

The docs say that for genai we can only run pre-optimized models through, but theuy also say that we can compile ONNX/TF models to hef so

Michael · February 8, 2026, 11:14am

Hi @jordanskole , For time being the compilation is intended for the regular non GenAI models.

jeff.singleton · February 8, 2026, 5:37pm

I need to point out an issue with one of the newer models available - Qwen2-1.5B-Instruct-Function-Calling-v1 - which appears to be an attempt at something for Home Automation? The v1 model is too big for the Hailo10h and so it just won’t load.

Model is here:

https://dev-public.hailo.ai/v5.2.0/blob/Qwen2-1.5B-Instruct-Function-Calling-v1.hef

Michael · February 8, 2026, 7:54pm

Hi @jeff.singleton ,

Thanks for calling it out - we are appreciate it!
The model is currently not supported by Hailo-Ollama: hailo_model_zoo_genai/docs/MODELS.rst at main · hailo-ai/hailo_model_zoo_genai · GitHub
It should be possible to use it via the API, e.g., VDevice object etc.

Thanks,

jeff.singleton · February 8, 2026, 8:04pm

Thanks Michael, I’m actually not running it with Hailo-Ollama, but rather with the custom method that @jordanskole and I have going with tools support.

The output from “hailortcli parse-hef ./Qwen2-1.5B-Instruct-Function-Calling-v1.hef” shows the Hailo10h is supported by the model.

Is it just not in the HailoRT code yet?

Is there anything I can do to help speed things along?

Regards,

Jeff

Michael · February 8, 2026, 8:11pm

Hi @jeff.singleton ,

The model should work with something like this: hailo-apps/hailo_apps/python/gen_ai_apps/simple_llm_chat at main · hailo-ai/hailo-apps · GitHub

Then I would try it here: hailo-apps/hailo_apps/python/gen_ai_apps/agent_tools_example at main · hailo-ai/hailo-apps · GitHub

Thanks,

Christian_Schnizler · March 18, 2026, 3:34pm

Did you manage to fix this?

jeff.singleton · March 18, 2026, 5:47pm

Not yet. The HailoRT libraries are still limited to specific models and those models are very small that require re-training to be useful.

Christian_Schnizler · March 19, 2026, 8:47am

@jeff.singleton Btw you can load the model with something like this:

class HailoAgent:
    def __init__(self, hef_path: str, tools_metadata: List[dict]):
        self.hef_path = Path(hef_path)
        self.tools_metadata = tools_metadata
        self.llm = None

        console.print(f"[yellow]Initialisiere Qwen 1.5B Agent (Hailo-10H)...[/yellow]")

        try:
            params = VDevice.create_params()
            params.scheduling_algorithm = HailoSchedulingAlgorithm.ROUND_ROBIN
            self.vdevice = VDevice(params)

            # Initialisierung mit lora_name (verifiziert via Shell)
            self.llm = LLM(
                self.vdevice,
                str(self.hef_path),
                lora_name="huggingface_lora_adapter",
                optimize_memory_on_device=True
            )
            console.print(f"[bold green]SUCCESS:[/bold green] Agent Hardware-Layer bereit!")

        except Exception as e:
            console.print(f"[red]Initialisierungsfehler:[/red] {e}")
            raise

It works for me with the ORT 5.2.0 version.

Gemini explanation why you have to do it this way with the lora adapter:
Here is the technical explanation of the Prefill vs. TBT (Token-By-Token) split and why they can’t simply be one single big block in memory.

1. The Two Phases of “Thinking”

An LLM doesn’t process everything the same way. It operates in two distinct mathematical stages:

The Prefill Phase (...__prefill): When you send a prompt (e.g., “What is the CPU temp?”), the model has to digest the entire sentence at once. This requires high parallel processing power. The hardware needs to look at every word and understand the context. This phase is “compute-heavy.”
The TBT Phase (...__tbt / Decode): Once the model understands the prompt, it generates the answer one word (token) at a time. To generate the next word, it only needs to look at the previous words. This phase is “memory-bandwidth heavy” and happens in a loop.

2. The Hardware “Puzzle” (Why two subnets?)

The Hailo-10H is an NPU (Neural Processing Unit), not a general-purpose CPU. It doesn’t just “run code”; it physically maps the neural network onto its silicon cells.

Optimization: Because Prefill and TBT have different mathematical needs (one needs to see the whole prompt, the other needs to loop efficiently), Hailo creates two different optimized “maps” (subnets) for the same model.
Specialization: The prefill subnet is optimized for throughput (speed of reading), while the tbt subnet is optimized for latency (speed of responding).

3. The Memory Wall: Why they don’t fit together

The Hailo-10H is incredibly powerful, but its SRAM (on-chip high-speed memory) is finite.

Size: Qwen 1.5B, even when compressed (quantized), is a large “mathematical object.”
The Conflict: If you tried to load both the prefill and the tbt architectures into the chip’s memory at the exact same time, you would run out of space. It would be like trying to fit two different full-sized maps of the same city on one small desk.
The Solution (Context Switching): The Hailo driver is smart. When you send a query, it uses the prefill configuration to “digest” the prompt. Then, it instantly swaps or reconfigures the internal logic to the tbt version to start typing the answer.

4. What `lora_name` solves for you

This is where your “Golden Ticket” comes in.

Normally, a developer would have to manually write code to:

Load Prefill.
Run Prefill.
Unload Prefill / Load TBT.
Run TBT loop.

By using lora_name="huggingface_lora_adapter", the Hailo genai library says:

“I recognize these two sub-networks belong together. I will handle the high-speed swapping between them in the background so the user only sees a smooth, continuous response.”

Summary

You have two subnets because reading a prompt and writing an answer require different hardware configurations. They don’t fit together because the Hailo chip prioritizes speed by using its limited on-chip memory for the active task only, rather than wasting space holding both configurations at once.