I encountered a critical issue while running a stress test on the Hailo-8L processor with a Raspberry Pi 5 using the 57.hef AI model. The error message I received is as follows:
[HailoRT] [critical] Got health monitor CPU ECC fatal event. memory_bitmap=4096
Test Details:
Hardware:
Hailo-8L
Raspberry Pi 5
AI Model: 57.hef (default model provided by Hailo)
Command Used:
hailortcli run /var/hailo_integration_tool/thermal/HAILO8L/57.hef -t 172800 --measure-temp
Questions:
What does the memory_bitmap=4096 indicate in the context of the ECC fatal event?
Is this issue indicative of a hardware fault, or could it be related to software/firmware?
Are there recommended steps to debug or resolve this issue?
Has anyone else encountered similar errors during stress tests, and if so, how were they mitigated?
I would greatly appreciate any guidance or insights from the community regarding this issue. If further details are needed, I am happy to provide them.
I’ve also experienced this problem when the system overheating to temperatures around 103-104 degrees Celsius. I’m using the Hailo-8L with a Raspberry Pi 5. Has anyone found a solution to this issue? @Omer can you suggest some solution for this
Do you have a fan running? What is the CPU temperature when this occurs? For the record, I’m running object detection continuously for months now (surveillance) on a Rpi5 + M2 Hailo8L and I’ve never seen this. But my CPU load is quite low, since all the object detection is being done by the Hailo. My CPU temp is around 50 degrees.
hi @rogojin, we are doing a stress to on hailo using
hailortcli run /var/hailo_integration_tool/thermal/HAILO8L/57.hef -t 172800 --measure-temp
This makes the hailo to reach 104-degree Celsius temperature under 5 minutes. The raspberry pi 5 had its official active cooler fan installed but neither heatsink or cooler fan was installed for hailo as our intention was to stress the hailo under high temperatures and to analyse its throttling mechanism.
The throttling for orange zone started at 103.7-degree Celsius but unfortunately, we’re also getting the warning:
[HailoRT] [critical] Got health monitor CPU ECC fatal event. memory_bitmap=4096
It would be really helpful if someone can explain what this issue is all about and what caused it.
Hi @prasaanthg,
I’m not running on on 26TOPS HAT, for over 35 minutes, and while I can definitely feel that the device is hot, I’m not getting a failure.
The ambinet in my room is about 21C.
I’ve checked the internal thermal report and all are passed.
Are you using the embedded HAT or the early ones, with the M.2 pluged in?
The ECC fatal event is occurring every time i run the stress test using the following command:
hailortcli run /var/hailo_integration_tool/thermal/HAILO8L/57.hef -t 172800 --measure-temp
are you testing the hardware using the same command?
The reason im running 57.hef with hailortcli run instead of the integration tool is because of the fact that the stress test in integration tool fails as soon as the orange zone is reached and thus i can’t stress further on throttling.
Can you please try to replicate it or at least give an update on what the warning is all about and what causes it?
@nadav can you also say what is the max temperature you are getting?
Whether it started to throttle in your case?
What method you are using to stress the hailo?
are you using any heat sinks for hailo?
Thank you Prasaanth,
One thing that is hard for me to see, have you placed the thermal pad on the bottom part to increase the attachment surface of the Hailo to the HAT?
In general I would also say, that the “57”, is an artificial network. By being such, it is only designed to heat up the device, and doesn’t do an actual task. It is harsher than an actual load that would run on the Pi.
You are completely right, we’re working hard to complement our SW with good docs. We have added descriptions for errors, but this was somehow not one of it.