Environment
-
Hardware: Raspberry Pi 5 + Raspberry Pi AI HAT+ 2 (Hailo-10H, 40 TOPS)
-
OS: Debian Trixie / Raspberry Pi OS
-
Installed Packages:
-
hailo-h10-all: 5.1.1 -
h10-hailort: 5.1.1 -
h10-hailort-pcie-driver: 5.1.1 -
hailo-gen-ai-model-zoo: 5.1.1 -
hailo-models: 1.0.0-2
-
-
Backend: Official
hailo-ollamaserver running on port 8000 -
Frontend: Open WebUI
Description
There are two critical issues when running the deepseek_r1_distill_qwen:1.5b model from the official Hailo GenAI Model Zoo on the Hailo-10H hardware:
-
Chat Template Corruption & Severe Hallucinations: The model completely fails to parse the chat/reasoning template. The closing
</think>tag is leaked and printed at the very beginning of the assistant’s response. Because the context window gets polluted with its own reasoning tokens, the model loses track of the conversation, outputs broken C++ code templates, and suffers from extreme hallucinations (e.g., translating a popular Polish tongue-twister “Chrząszcz brzmi…” as “I will become cold”, or stating that 50 million Chinese citizens reside in Poland). -
Server Crash on Follow-up Prompts: When sending a second prompt in the same session, the
hailo-ollamabackend attempts to re-allocate or reset the model context (GenerationThread:replacing model to...). At this exact moment, the server process crashes completely and exits to the shell without throwing a proper runtime exception, causing aTransferEncodingError: 400 (Not enough data to satisfy transfer length header)on the client side.
Steps to Reproduce
-
Start the server on the host:
HAILO_MONITOR=1 hailo-ollama > hailo_server.log 2>&1 & -
Connect Open WebUI to the
hailo-ollamabackend (port 8000). -
Pull and select the
deepseek_r1_distill_qwen:1.5bmodel. -
Ask a standard question or a multi-lingual prompt. Notice the corrupted
</think>placement and gibberish response. -
Send a follow-up prompt in the same chat session.
-
The frontend loses connection because the
hailo-ollamaprocess has terminated.
Actual Behavior / Logs
Here is the exact output from hailo_server.log showing the chat log corruption and the sudden crash at the end of the second prompt:
I |2026-06-15 22:01:29 1781553689141208| MyApp:Server running on port 8000
I |2026-06-15 22:02:13 1781553733802447| list_models:got 5 models in store
I |2026-06-15 22:02:13 1781553733802521| list_models:model: deepseek_r1_distill_qwen:1.5b
I |2026-06-15 22:02:13 1781553733802589| list_models:model: llama3.2:3b
I |2026-06-15 22:02:13 1781553733802692| list_models:model: qwen2.5-coder:1.5b
I |2026-06-15 22:02:13 1781553733802720| list_models:model: qwen2.5-instruct:1.5b
I |2026-06-15 22:02:13 1781553733802736| list_models:model: qwen2:1.5b
I |2026-06-15 22:07:48 1781554068259412| handle_completion:Got model /usr/share/hailo-ollama/models/blob/sha256_1129f5f8384e4e45c5890104dc4ec1aee77e800ce1484ddc3aa942399aada425
I |2026-06-15 22:07:48 1781554068259467| GenerationThread:got prompt <|begin_of_text|><|start_header_id|>system<|end_header_id|>
Cutting Knowledge Date: December 2023
Today Date: 15 Jun 2026
<|eot_id|><|start_header_id|>user<|end_header_id|>
Tell me what is the proper way to open file in c++ and fill the variable with file content with stack overflow guard?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Here's an example of the proper ways to open a file, read its Content and store it into a variable while using Stack Overflow Guard in C++:
... [Leaked system template instructions and broken hallucinated response omitted for brevity] ...
If you're manually linking libraries, make sure to include them in your linker command. For example:
- To link against the C++ Standard Library, you'll typically add `-lstdc++` to your compiler command.
- To link against the Windows API, you can use the `/STACKRECLAIMER` option when launching the executable from the IDE or command prompt. This tells the runtime to release the stack space used by the program, allowing it to run indefinitely.
Let me know if you'd have any further questions!
I<|eot_id|><|start_header_id|>user<|end_header_id|>
Now the same please using python!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
I |2026-06-15 22:07:50 1781554070282384| GenerationThread:replacing model to /usr/share/hailo-ollama/models/blob/sha256_1129f5f8384e4e45c5890104dc4ec1aee77e800ce1484ddc3aa942399aada425
marek@rp5:~ $
Expected Behavior
-
The ChatML / DeepSeek reasoning tokens (
<think>and</think>) should be parsed internally by the backend/Open WebUI wrapper instead of being leaked straight into the chat prompt context, which destroys multi-turn coherence. -
The model should remain stable and handle multi-turn conversations without causing a segmentation fault or a silent crash of the
hailo-ollamadaemon during context refreshment (GenerationThread:replacing model).
Additional attachment
Corrupted Chat:
### USER
do you speak polish or english?
### ASSISTANT
Greetings! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. I specialize in helping others learn and improving their abilities. I have a deep understanding of machine learning, computer science, and related fields. I am here to help answer your questions and provide information on various topics.
</think>
### USER
answer my question
### ASSISTANT
Okay, the user greeted me and said "answer your query." I need to figure out what they're asking for since my last response was a general greeting.
Hmm, maybe they didn't specify exactly what they want. They might be looking for help with something specific or just curious about how I can assist them now that I'm here.
I should respond in a friendly way but also open up to their needs because they didn’t provide much detail yet.
</think>
Hello! How can I assist you today? If there's anything specific you'd like me to help with, feel free to ask.
### USER
do you speak polish or english?
### ASSISTANT
Alright, the user is asking whether I speak Polish or English. Let me check my last message.
I just responded to their question by greeting them and explaining my capabilities in both languages.
So yes! I do understand both Polish and English, which means I can assist with a wide range of topics in either language.
</think>
Yes, I understand both Polish and English. I'm here to help with any questions, information, or assistance you need in either language.
### USER
Dokończ zdanie “Chrząszcz brzmi…”
### ASSISTANT
Alright, the user is asking to complete the sentence "Chrząszcz brzmi…." which translates to "I will become cold...". I need to make sure this makes sense in the context of their message.
Considering the tone and intent behind the original greeting, it's safe to proceed with a polite and friendly response. Let me craft a simple and warm concluding phrase that invites further interaction.
</think>
Doksi! To keep the mood positive, I'll keep you updated on everything going on around you. 😊
### USER
Powiedz po polsku ile Chiny mają mieszkańców
### ASSISTANT
Okay, the user wants to know how many Chinese people live in Poland. I need to provide accurate and up-to-date information.
I should check the latest statistics on this topic. According to recent reports, approximately 50 million Chinese citizens reside in Poland. This includes both urban and rural areas.
</think>
Lepsia! According to the most recent data, there are **50 million** Chinese people living in Poland. This number includes both urban and rural residents, reflecting the country's significant population in the region. 😊
If you'd like more detailed information or context, feel free to ask!