HailoRT error: MD5 validation of control response failed

Hello,

I am using Ubuntu 20.04 with the 5.15.0-67 kernel. I am using version 4.16.0 of the PCIe driver and HailoRT. After running an installer for our software, I am getting this error when I try to run an AI model:

[HailoRT][error] CHECK failed - MD5 validation of internal response failed.

Any help would be appreciated! Thanks.

Hey @Rory_Cutler,

Welcome to the Hailo Community!

This sounds like a version mismatch with your driver. Can you try running hailortcli fw-control identify to check what’s going on?

That should help us figure out what’s happening.

Thanks for responding! I get this when I run that command this time:

Executing on device: 0000:03:00.0
Identifying board
Control Protocol Version: 2
Firmware version: 4.16.0 (release,app,extended content switch buffer)
Logger version: 0
Board Name: Hailo-8
Device architecture: HAILO8
Serial Number: HLLWM2B225100068
Part Number: HM218B1C2FAE
Product Name: HAILO-8 AI ACC M.2 M KEY MODULE EXT TEMP

The only consistent issue I am getting is a timeout in pyhailort where it says ‘hailort has failed because a timeout has occurred’. Sometimes the AI will be stable and run but other times I get a time out and this happens.

A bit more information. When I try to install the dependencies for the 4.16.0 PCIe driver and HailoRT using apt it tells me I have dependency problems with libelf1 and libelf-dev, which I have to manually downgrade to avoid errors when using dpkg to install HailoRT, the PCIe driver and its dependencies. Is it worth upgrading the apt version of HailoRT, the PCIe driver and the python runtime to avoid the need to do this?

The main issue that reoccurs is [HailoRT][error] HAILO_FW_CONTROL failed with errno:19. The error messages that follow tend to be [HailoRT][error] CHECK_SUCCESS failed with status=HAILO_FW_CONTROL_FAILURE(18) - failed to send fw control. Apologies for the spam; been debugging all morning.

Hey @Rory_Cutler,

First off, I’d recommend upgrading to a newer version of HailoRT - we’ve made significant improvements to the software that make it much more stable.

If upgrading isn’t an option for you right now, try completely removing all the current installations and packages, then reinstall everything without DKMS. This workaround often resolves the issue.

If you’re still running into problems after that, we’ll need to take a closer look at the dmesg logs and HailoRT logs when the issue occurs. You’d need to run it while monitoring the logs, or you can send me both the logs and the dmesg output related to Hailo so I can help troubleshoot further.

Let me know how it goes!

Hello,

Thanks for responding!

I have done a fresh install of Hailo RT 4.16.0 and the corresponding PCIe driver. We had no reason to believe 4.16.0 was unstable until recently. There have been no changes to our AI model and no changes to our codebase that runs the Hailo runtime since July 2024.

With 4.16.0, I get this error from HailoRT whilst running the application with Python 3.8:

Here is the output from dmesg | grep hailo. Also with 4.16.0:

One thing I have noticed is that setting pci=nomsi and pci_aspm=off in the grub (and then running sudo update-grub) appears to drastically increase stability and these errors happen at a much lower frequency. The pci=nomsi appears to be most important factor that increases stability.

Done some more tests with fresh install of 4.20.0 and I am getting the same behaviour in my application (AI inference crash) but the responses dmesg etc. are slightly different this time. The HailoRT exception in Python (3.8) is similar to before:

HAILO_FW_CONTROL failed with 19.
The specific HailoRT error is CHECK_SUCCESS failed with status=HAILO_DRIVER_OPERATION_FAILED(36)

  1. dpkg -l | egrep ‘hailort|hailofw|hailort-pcie-driver’: major and minor the same. both hailort and hailort-pci-driver are 4.20.0.

  2. sudo dkms status | grep hailo_pci returns 'hailo_pci, 4.20.0, 5.15.0-67-generic x86_64: installed.

modinfo hailo_pci | egrep ‘version|vermagic’ returns
version: 4.20.0
srcversion: IGNORE THIS a number I can’t copy from linux (sorry)
vermagic: 5.15.0-67-generic SMP mod_unload modversions

ls -l /dev/hailo* returns:

crw-rw-rw- 1 root root 509, 0 Aug 1 10:40 /dev/hailo

  1. lspci -nn | grep -I hailo returns:
    03:00:0 Co-processor [0b440]: Hailo technologies Ltd. Hailo-8 AI Processor [1e60:2864] (rev 01)

sudo dmesg -T | grep hailo returns:

[ 13.664054] hailo_pci: loading out-of-tree module taints kernel.
[ 13.664086] hailo_pci: module verification failed: signature and/or required key missing - tainting kernel
[ 13.667952] hailo: Init module. driver version 4.20.0
[ 13.668175] hailo 0000:03:00.0: Probing on: 1e60:2864…
[ 13.668177] hailo 0000:03:00.0: Probing: Allocate memory for device extension, 13184
[ 13.668187] hailo 0000:03:00.0: enabling device (0000 → 0002)
[ 13.668504] hailo 0000:03:00.0: Probing: Device enabled
[ 13.668527] hailo 0000:03:00.0: Probing: mapped bar 0 - 0000000012b2a4ed 16384
[ 13.668534] hailo 0000:03:00.0: Probing: mapped bar 2 - 0000000052757565 4096
[ 13.668539] hailo 0000:03:00.0: Probing: mapped bar 4 - 000000002a23d819 16384
[ 13.668543] hailo 0000:03:00.0: Probing: Setting max_desc_page_size to 4096, (page_size=4096)
[ 13.668606] hailo 0000:03:00.0: Probing: Enabled 64 bit dma
[ 13.668608] hailo 0000:03:00.0: Probing: Using userspace allocated vdma buffers
[ 13.668611] hailo 0000:03:00.0: Disabling ASPM L0s
[ 13.668613] hailo 0000:03:00.0: Successfully disabled ASPM L0s
[ 13.669627] hailo 0000:03:00.0: Writing file hailo/hailo8_fw.bin
[ 13.703122] hailo 0000:03:00.0: File hailo/hailo8_fw.bin written successfully
[ 13.703125] hailo 0000:03:00.0: Writing file hailo/hailo8_board_cfg.bin
[ 13.703141] Failed to write file hailo/hailo8_board_cfg.bin
[ 13.703142] hailo 0000:03:00.0: File hailo/hailo8_board_cfg.bin written successfully
[ 13.703143] hailo 0000:03:00.0: Writing file hailo/hailo8_fw_cfg.bin
[ 13.703149] Failed to write file hailo/hailo8_fw_cfg.bin
[ 13.703149] hailo 0000:03:00.0: File hailo/hailo8_fw_cfg.bin written successfully
[ 13.830085] hailo 0000:03:00.0: NNC Firmware loaded successfully
[ 13.830089] hailo 0000:03:00.0: FW loaded, took 160 ms
[ 13.851394] hailo 0000:03:00.0: Probing: Added board 1e60-2864, /dev/hailo0
[ 81.207852] hailo 0000:03:00.0: Failed writing fw control to pcie
[ 81.257945] hailo 0000:03:00.0: Failed writing fw control to pcie
[ 81.292216] hailo 0000:03:00.0: hailo_nnc_driver_down, timeout waiting for shutdown response (timeout_ms=5)
[ 81.378612] hailo 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Transmitter ID)
[ 81.378615] hailo 0000:03:00.0: device [1e60:2864] error status/mask=000010c1/00006000
[ 81.378616] hailo 0000:03:00.0: [ 0] RxErr
[ 81.378617] hailo 0000:03:00.0: [ 6] BadTLP
[ 81.378618] hailo 0000:03:00.0: [ 7] BadDLLP
[ 81.378619] hailo 0000:03:00.0: [12] Timeout

hailort scan returns:
Hailo Devices:
[-] Device: 0000:03:00.0

hailortcli fw-control identify returns:
Executing on device: 0000:03:00.0
Identifying board:
Control Protocol Version: 2
Firmware version: 4.20.0 (release, app, extended context switch buffer)
Logger Version: 0
Board Name: Hailo-8
Device Architecture: HAILO8
Serial Number: HLLWM2B2251000066
Part Number: HM218B1C2FAE
Product Name: HAILO-8 AI ACC M.2 M KEY MODULE EXT TEMP

  1. settings pci_aspm=off and pcie=nomsi in the grub does increase stability slightly but not by enough. After some testing; I am still having issues with these settings. It does not solve the problem.