[HailoRT] Got CPU ECC Fatal Event During Stress Test on Hailo-8L with Raspberry Pi 5

Hello Hailo Community,

I encountered a critical issue while running a stress test on the Hailo-8L processor with a Raspberry Pi 5 using the 57.hef AI model. The error message I received is as follows:

[HailoRT] [critical] Got health monitor CPU ECC fatal event. memory_bitmap=4096

Test Details:

  • Hardware:
    • Hailo-8L
    • Raspberry Pi 5
  • AI Model: 57.hef (default model provided by Hailo)
  • Command Used:
    hailortcli run /var/hailo_integration_tool/thermal/HAILO8L/57.hef -t 172800 --measure-temp
    

Questions:

  1. What does the memory_bitmap=4096 indicate in the context of the ECC fatal event?
  2. Is this issue indicative of a hardware fault, or could it be related to software/firmware?
  3. Are there recommended steps to debug or resolve this issue?
  4. Has anyone else encountered similar errors during stress tests, and if so, how were they mitigated?

I would greatly appreciate any guidance or insights from the community regarding this issue. If further details are needed, I am happy to provide them.

Thank you for your support!


@omria any update on this issue

@nina-vilela @Nadav @Omer can anyone please give some update on this? It been a week

I’ve also experienced this problem when the system overheating to temperatures around 103-104 degrees Celsius. I’m using the Hailo-8L with a Raspberry Pi 5. Has anyone found a solution to this issue? @Omer can you suggest some solution for this

Hi @prasaanthg, @manikantaj, we are checking on this.

1 Like

Do you have a fan running? What is the CPU temperature when this occurs? For the record, I’m running object detection continuously for months now (surveillance) on a Rpi5 + M2 Hailo8L and I’ve never seen this. But my CPU load is quite low, since all the object detection is being done by the Hailo. My CPU temp is around 50 degrees.

hi @rogojin, we are doing a stress to on hailo using

hailortcli run /var/hailo_integration_tool/thermal/HAILO8L/57.hef -t 172800 --measure-temp

This makes the hailo to reach 104-degree Celsius temperature under 5 minutes. The raspberry pi 5 had its official active cooler fan installed but neither heatsink or cooler fan was installed for hailo as our intention was to stress the hailo under high temperatures and to analyse its throttling mechanism.
The throttling for orange zone started at 103.7-degree Celsius but unfortunately, we’re also getting the warning:

[HailoRT] [critical] Got health monitor CPU ECC fatal event. memory_bitmap=4096

It would be really helpful if someone can explain what this issue is all about and what caused it.

1 Like

@Nadav any updates on this?

Hi @prasaanthg,
I’m not running on on 26TOPS HAT, for over 35 minutes, and while I can definitely feel that the device is hot, I’m not getting a failure.

The ambinet in my room is about 21C.

I’ve checked the internal thermal report and all are passed.

Are you using the embedded HAT or the early ones, with the M.2 pluged in?

Thank for the update @Nadav I’m using the Raspberry pi’s official
M.2 AI hat that has hailo-8l.

The ECC fatal event is occurring every time i run the stress test using the following command:

hailortcli run /var/hailo_integration_tool/thermal/HAILO8L/57.hef -t 172800 --measure-temp

are you testing the hardware using the same command?
The reason im running 57.hef with hailortcli run instead of the integration tool is because of the fact that the stress test in integration tool fails as soon as the orange zone is reached and thus i can’t stress further on throttling.

Can you please try to replicate it or at least give an update on what the warning is all about and what causes it?

@nadav can you also say what is the max temperature you are getting?
Whether it started to throttle in your case?
What method you are using to stress the hailo?
are you using any heat sinks for hailo?

Average is 96C, no throttle, used this command:
hailortcli run 57.hef -t 9000 --measure-temp

No heatsink applied

@Nadav are you using the same raspberry pi AI kit we’re using?

I believe I am. Can you send me a picture of your setup?

קבל ‏Outlook עבור Android‏

@Nadav please refer following attached pictures that show the setup we’re using as well as the error we’re facing:



Thank you Prasaanth,
One thing that is hard for me to see, have you placed the thermal pad on the bottom part to increase the attachment surface of the Hailo to the HAT?

In general I would also say, that the “57”, is an artificial network. By being such, it is only designed to heat up the device, and doesn’t do an actual task. It is harsher than an actual load that would run on the Pi.

@Nadav the thermal pad is already there from the time of purchase and its picture is attached below for your reference.

I understand that 57.hef is just a model for thermal testing but could you at least give a note on the error for our understanding:

[HailoRT] [critical] Got health monitor CPU ECC fatal event. memory_bitmap=4096

It seems like error correction code is failed. but can you explain it in detail along with the info of bitmap?

@Nadav it would be really helpful for everyone if you code provide a documentation on error log.

You are completely right, we’re working hard to complement our SW with good docs. We have added descriptions for errors, but this was somehow not one of it.