I am using Ubuntu 20.04 with the 5.15.0-67 kernel. I am using version 4.16.0 of the PCIe driver and HailoRT. After running an installer for our software, I am getting this error when I try to run an AI model:
[HailoRT][error] CHECK failed - MD5 validation of internal response failed.
Thanks for responding! I get this when I run that command this time:
Executing on device: 0000:03:00.0
Identifying board
Control Protocol Version: 2
Firmware version: 4.16.0 (release,app,extended content switch buffer)
Logger version: 0
Board Name: Hailo-8
Device architecture: HAILO8
Serial Number: HLLWM2B225100068
Part Number: HM218B1C2FAE
Product Name: HAILO-8 AI ACC M.2 M KEY MODULE EXT TEMP
The only consistent issue I am getting is a timeout in pyhailort where it says ‘hailort has failed because a timeout has occurred’. Sometimes the AI will be stable and run but other times I get a time out and this happens.
A bit more information. When I try to install the dependencies for the 4.16.0 PCIe driver and HailoRT using apt it tells me I have dependency problems with libelf1 and libelf-dev, which I have to manually downgrade to avoid errors when using dpkg to install HailoRT, the PCIe driver and its dependencies. Is it worth upgrading the apt version of HailoRT, the PCIe driver and the python runtime to avoid the need to do this?
The main issue that reoccurs is [HailoRT][error] HAILO_FW_CONTROL failed with errno:19. The error messages that follow tend to be [HailoRT][error] CHECK_SUCCESS failed with status=HAILO_FW_CONTROL_FAILURE(18) - failed to send fw control. Apologies for the spam; been debugging all morning.
First off, I’d recommend upgrading to a newer version of HailoRT - we’ve made significant improvements to the software that make it much more stable.
If upgrading isn’t an option for you right now, try completely removing all the current installations and packages, then reinstall everything without DKMS. This workaround often resolves the issue.
If you’re still running into problems after that, we’ll need to take a closer look at the dmesg logs and HailoRT logs when the issue occurs. You’d need to run it while monitoring the logs, or you can send me both the logs and the dmesg output related to Hailo so I can help troubleshoot further.
I have done a fresh install of Hailo RT 4.16.0 and the corresponding PCIe driver. We had no reason to believe 4.16.0 was unstable until recently. There have been no changes to our AI model and no changes to our codebase that runs the Hailo runtime since July 2024.
With 4.16.0, I get this error from HailoRT whilst running the application with Python 3.8:
One thing I have noticed is that setting pci=nomsi and pci_aspm=off in the grub (and then running sudo update-grub) appears to drastically increase stability and these errors happen at a much lower frequency. The pci=nomsi appears to be most important factor that increases stability.
Done some more tests with fresh install of 4.20.0 and I am getting the same behaviour in my application (AI inference crash) but the responses dmesg etc. are slightly different this time. The HailoRT exception in Python (3.8) is similar to before:
HAILO_FW_CONTROL failed with 19.
The specific HailoRT error is CHECK_SUCCESS failed with status=HAILO_DRIVER_OPERATION_FAILED(36)
dpkg -l | egrep ‘hailort|hailofw|hailort-pcie-driver’: major and minor the same. both hailort and hailort-pci-driver are 4.20.0.
modinfo hailo_pci | egrep ‘version|vermagic’ returns
version: 4.20.0
srcversion: IGNORE THIS a number I can’t copy from linux (sorry)
vermagic: 5.15.0-67-generic SMP mod_unload modversions
ls -l /dev/hailo* returns:
crw-rw-rw- 1 root root 509, 0 Aug 1 10:40 /dev/hailo
hailortcli fw-control identify returns:
Executing on device: 0000:03:00.0
Identifying board:
Control Protocol Version: 2
Firmware version: 4.20.0 (release, app, extended context switch buffer)
Logger Version: 0
Board Name: Hailo-8
Device Architecture: HAILO8
Serial Number: HLLWM2B2251000066
Part Number: HM218B1C2FAE
Product Name: HAILO-8 AI ACC M.2 M KEY MODULE EXT TEMP
settings pci_aspm=off and pcie=nomsi in the grub does increase stability slightly but not by enough. After some testing; I am still having issues with these settings. It does not solve the problem.