@jeff.singleton Btw you can load the model with something like this:
class HailoAgent:
def __init__(self, hef_path: str, tools_metadata: List[dict]):
self.hef_path = Path(hef_path)
self.tools_metadata = tools_metadata
self.llm = None
console.print(f"[yellow]Initialisiere Qwen 1.5B Agent (Hailo-10H)...[/yellow]")
try:
params = VDevice.create_params()
params.scheduling_algorithm = HailoSchedulingAlgorithm.ROUND_ROBIN
self.vdevice = VDevice(params)
# Initialisierung mit lora_name (verifiziert via Shell)
self.llm = LLM(
self.vdevice,
str(self.hef_path),
lora_name="huggingface_lora_adapter",
optimize_memory_on_device=True
)
console.print(f"[bold green]SUCCESS:[/bold green] Agent Hardware-Layer bereit!")
except Exception as e:
console.print(f"[red]Initialisierungsfehler:[/red] {e}")
raise
It works for me with the ORT 5.2.0 version.
Gemini explanation why you have to do it this way with the lora adapter:
Here is the technical explanation of the Prefill vs. TBT (Token-By-Token) split and why they can’t simply be one single big block in memory.
1. The Two Phases of “Thinking”
An LLM doesn’t process everything the same way. It operates in two distinct mathematical stages:
-
The Prefill Phase (...__prefill): When you send a prompt (e.g., “What is the CPU temp?”), the model has to digest the entire sentence at once. This requires high parallel processing power. The hardware needs to look at every word and understand the context. This phase is “compute-heavy.”
-
The TBT Phase (...__tbt / Decode): Once the model understands the prompt, it generates the answer one word (token) at a time. To generate the next word, it only needs to look at the previous words. This phase is “memory-bandwidth heavy” and happens in a loop.
2. The Hardware “Puzzle” (Why two subnets?)
The Hailo-10H is an NPU (Neural Processing Unit), not a general-purpose CPU. It doesn’t just “run code”; it physically maps the neural network onto its silicon cells.
-
Optimization: Because Prefill and TBT have different mathematical needs (one needs to see the whole prompt, the other needs to loop efficiently), Hailo creates two different optimized “maps” (subnets) for the same model.
-
Specialization: The prefill subnet is optimized for throughput (speed of reading), while the tbt subnet is optimized for latency (speed of responding).
3. The Memory Wall: Why they don’t fit together
The Hailo-10H is incredibly powerful, but its SRAM (on-chip high-speed memory) is finite.
-
Size: Qwen 1.5B, even when compressed (quantized), is a large “mathematical object.”
-
The Conflict: If you tried to load both the prefill and the tbt architectures into the chip’s memory at the exact same time, you would run out of space. It would be like trying to fit two different full-sized maps of the same city on one small desk.
-
The Solution (Context Switching): The Hailo driver is smart. When you send a query, it uses the prefill configuration to “digest” the prompt. Then, it instantly swaps or reconfigures the internal logic to the tbt version to start typing the answer.
4. What lora_name solves for you
This is where your “Golden Ticket” comes in.
Normally, a developer would have to manually write code to:
-
Load Prefill.
-
Run Prefill.
-
Unload Prefill / Load TBT.
-
Run TBT loop.
By using lora_name="huggingface_lora_adapter", the Hailo genai library says:
“I recognize these two sub-networks belong together. I will handle the high-speed swapping between them in the background so the user only sees a smooth, continuous response.”
Summary
You have two subnets because reading a prompt and writing an answer require different hardware configurations. They don’t fit together because the Hailo chip prioritizes speed by using its limited on-chip memory for the active task only, rather than wasting space holding both configurations at once.