Workflow Issues with simple CNN model

Hi there,

I’ve been trying to get into the Hailo-8 workflow and started from a very simply CNN model:

https://SuperResolution ONNX

Even though such things are not directly on my job description, I always managed to get CNN models running from any framework and model type on different RT engines. But I struggle and fail with the Hailo tools, and let’s say: I’m not happy with the tools I got.

I’m using DFC docker image (hailo_ai_sw_suite_2024-10). I managed to parse the model and save it as HAR file without any complaints. In the optimize step, I receive odd Tensorflow messages:

I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at linux/Documentation/ABI/testing/sysfs-bus-pci at v6.0 · torvalds/linux · GitHub
I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor ‘Placeholder/_0’ with dtype int32
[[{{node Placeholder/_0}}]]
I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor ‘Placeholder/_0’ with dtype float and shape [1,224,224,1]
[[{{node Placeholder/_0}}]]

and I don’t know if these are relevant or not. I end up with a quantized model file though that I can compile and map to the Hailo-8 device.

When I try to use the model for inference on the physical device, I receive error messages:

[HailoRT] [error] CHECK failed - Memory size of vstream super-resolution-10/input_layer1 does not match the frame count! (Expected 1605632, got 7168)
[HailoRT] [error] CHECK_SUCCESS failed with status=HAILO_INVALID_ARGUMENT(2)

hailo_platform.pyhailort.pyhailort.HailoRTInvalidArgumentException: Exception encountered when calling layer ‘hw_inference_model’ (type HWInferenceModel).

Invalid argument. See hailort.log for more information

Call arguments received by layer ‘hw_inference_model’ (type HWInferenceModel):
• inputs=tf.Tensor(shape=(8, 224, 1), dtype=float32)

The input tensor shape was supposed to be (1, 224, 224, 1) or (1, 1, 224, 224) with dtype uint8. I have no idea where to look at now and in which stage I messed it up.

I’ve read through the Hailo Dataflow Compiler User Guide and the Tutorial notebooks stored on the docker image and pretty much did what was done there.

Any useful directions how to debug the consistency of each step and what’s important to check to make sure everything is on the right track?

Thanks!

Sorry, you ran into issues compiling the model. From the error message, it is possible that there is an installation issue of the Hailo SW suite docker related to the versions of CUDA or CUDNN installed on your system.
Please make sure to apply the instructions to install nvidia-docker2 listed here: 2024-10 | Hailo

To narrow-down if the GPU is the root-cause, you can disable it by running:

export CUDA_VISIBLE_DEVICES=''

and then run the following command line instructions in the shell:

hailo parser onnx super-resolution-10.onnx
hailo optimize super-resolution-10.har --use-random-calib-set
hailo compiler super-resolution-10_optimized.har

Let us know if these pieces of advice helped.

victorc,

thank you for the quick reply.

With your directions I pretty much end up with the same result. During inference, I receive following output from my script:

model_name = ‘super-resolution-10’
hef = HEF(f’{model_name}.hef’)
input_vstream_infos = hef.get_input_vstream_infos()[0]
print(“input_vstream_info.shape:”, input_vstream_infos.shape)
print(“input_vstream_info.format.order:”, input_vstream_infos.format.order)
print(“input_vstream_info.format.type:”, input_vstream_infos.format.type)
print(“input_vstream_info.format.flags:”, input_vstream_infos.format.flags)

result:

input_vstream_info.shape: (224, 224, 1)
input_vstream_info.format.order: FormatOrder.NHWC
input_vstream_info.format.type: FormatType.UINT8
input_vstream_info.format.flags: FormatFlags.NONE

But later:

runner = ClientRunner(hw_arch=‘hailo8’, har=f’{model_name}_compiled.har’)
print(“Input data shape/dtype:”, data_in.shape, data_in.dtype)
with runner.infer_context(InferenceContext.SDK_HAILO_HW) as hw_ctx:
result = runner.infer(hw_ctx, data_in)

result:

Input data shape/dtype: (224, 224, 1) uint8
Inference: 0entries [00:00, ?entries/s][HailoRT] [error] CHECK failed - Memory size of vstream super-resolution-10/input_layer1 does not match the frame count! (Expected 1605632, got 7168)
[HailoRT] [error] CHECK_SUCCESS failed with status=HAILO_INVALID_ARGUMENT(2)
Traceback (most recent call last):

hailo_platform.pyhailort.pyhailort.HailoRTInvalidArgumentException: Exception encountered when calling layer ‘hw_inference_model’ (type HWInferenceModel).

Invalid argument. See hailort.log for more information

Call arguments received by layer ‘hw_inference_model’ (type HWInferenceModel):
• inputs=tf.Tensor(shape=(8, 224, 1), dtype=float32)

in the logfile hailort.log are following error messages:

[2024-11-21 13:26:50.419] [97165] [HailoRT] [error] [inference_pipeline.cpp:274] [verify_memory_view_size] CHECK failed - Memory size of vstream super-resolution-10/input_layer1 does not match the frame count! (Expected 1605632, got 7168)
[2024-11-21 13:26:50.419] [97165] [HailoRT] [error] [inference_pipeline.cpp:201] [infer] CHECK_SUCCESS failed with status=HAILO_INVALID_ARGUMENT(2)

Any idea, what this could mean?

/André

Hi Andre, since you are using the runner.infer_context() API, I think you need to pass float32 input, not uint8. Under the hood, runn.infer_context() will convert from float32 to uint8.
This API is for verification only. If you want to do real inference, we advise you to use Python async API as demonstrated in the HRT_0_Async_Inference_Tutorial.ipynb tutorial, which you can start by running “hailo tutorial” in the hailo virtual environment. With this API, you can set the type of the input and output buffer and the API will take it into account.
Please see more details at: HailoRT v4.19.0

Hi Victor,

I managed to successfully run inference using InferVStreams(), but still had to use float32 dtype for it. When I try to use uint8 inference, as defined by input tensor dtype, I still receive this message:

Given input data dtype (uint8) is different than inferred dtype (float32). conversion for every frame will reduce performance

So, what actually defines the expected data type if not the type of the input tensor? And what do I need to tune (parsing? optimization? compilation?) to make sure it’s a pure uint8 in->out inference application? (float32 requires 4x more PCIe bandwidth!!)

The other thing: using runner.infer_context() may be a float32 only inference method, but it still has the problem of a false tensor dimension expectation:

Call arguments received by layer ‘hw_inference_model’ (type HWInferenceModel):
• inputs=tf.Tensor(shape=(8, 224, 1), dtype=float32)

There must be software bug involved, because in both cases I use the same compiled model.

/André

In the meanwhile, I figured that I can actually choose the datatype for inference over InferVStreams() input/output vstreams. Which I find pretty strange, since it is happening beyond optimizations and model compiling.

Shouldn’t buffer be constrained to this type in some way? Is it something that can be dynamically applied by your inference tools?

I didn’t do any benchmark tests yet, I believe the tool for now that inference can be indeed switched to uint8->uint8 mode. But I’m far of understanding what is actually going on there.

And on a side-note: output type uint8 results in a different data scaling than float32. This should not happen at all. I dare not to ask how I can correct this behaviour!? (Super-Resolution result image in my case is having a different data range, float32 after normalization ranges 0…255, uint8 ranges somwhat 12…230)

I’m closing this topiuc now as it is too confusing. I’ll start a new topic with the remaining issue.

/André

The Hailo device requires input data to be quantized before being sent to the device and output data to be de-quantized after being received from the device.

When a neural network is compiled by the Dataflow Compiler, each input and output layer is assigned a qp_zp (zero point) and qp_scale, both floating point values stored in the HEF file. These values are derived during the calibration process and depend on the dynamic range of the images present in the calibration dataset.

  • Input transformation: The input data is divided by the qp_scale, and the qp_zp is added to the result.

  • Output transformation: The qp_zp is subtracted from the output data, and the result is multiplied by the qp_scale.

The data quantization example provides code demonstrating how to use virtual streams for quantization to set the quantization behavior.

The format_type argument is used to control the quantization and de-quantization behavior of input and output streams.

  • Input streams: Passing HAILO_FORMAT_TYPE_AUTO to hailo_make_input_vstream_params() means the input data is already quantized. HailoRT will not perform any additional quantization. Passing HAILO_FORMAT_TYPE_FLOAT32 will cause the input data to be quantized.

  • Output streams: Passing HAILO_FORMAT_TYPE_AUTO to hailo_make_output_vstream_params() means the output data should remain quantized. HailoRT will not de-quantize the output data. Passing HAILO_FORMAT_TYPE_FLOAT32 will cause the output data to be de-quantized5.

Important note: In some cases, such as NMS output, HAILO_FORMAT_TYPE_AUTO for output streams actually represents HAILO_FORMAT_TYPE_FLOAT32, resulting in output de-quantization.

HailoRT’s inference tools handle data type conversions dynamically. When you choose a data type for InferVStreams() that differs from the device’s expected format, HailoRT automatically performs the necessary conversions.

  • Input vstreams: If you provide input data in float32 format but the device expects uint8, HailoRT quantizes the data before sending it to the device.

  • Output vstreams: Similarly, if the device outputs data in uint8 format, but you request float32 output, HailoRT de-quantizes the data before returning it to you.

This dynamic data type handling is controlled by the format_type parameter when creating vstream parameters:

  • HAILO_FORMAT_TYPE_AUTO: Tells HailoRT to automatically choose the format based on the device’s expectation. For input, this typically means the data is already quantized. For output, it usually means the data should remain quantized.

  • HAILO_FORMAT_TYPE_FLOAT32: Forces HailoRT to perform quantization (for input) or de-quantization (for output) even if the device’s default format is different.

This approach provides flexibility for users to work with data in their preferred format while ensuring compatibility with the Hailo device.

Regarding your other question about output type uint8 resulting in a different data scaling than float32. It is due to the fact that qp_scale and qp_zero need to be applied to your uint8 output to map to the full floating point range of [0.0 255.0].

You can use HailoRT APIs to obtain the quantization parameters (qp_scale and qp_zp) for output layers in a compiled Hailo model.

The sources describe these parameters being stored in the HEF file, which contains the quantized model information. The hailo_quant_info_t structure holds these parameters, along with limvals_min and limvals_max, which represent the minimum and maximum limit values.

Here are the APIs you can use to access these parameters:

  • Python API:

    • hailort.OutputStream.get_quant_infos() : This API retrieves the quantization information for the given output stream. It returns a vector of hailo_quant_info_t structures, each representing the quantization information for a specific feature in the output. If a single quantization parameter set is used for all features, the vector will contain only one element.
    • hailort.OutputVStream.get_quant_infos(): Similar to the previous API, but for output virtual streams.
  • C API:

    • hailo_get_output_stream_quant_infos(): This API retrieves the quantization information for a given output stream. It takes a pointer to a hailo_quant_info_t array, the maximum number of entries in the array, and the actual number of entries written.
    • hailo_get_output_vstream_quant_infos(): This API functions similarly to the previous API, but for output virtual streams.

These APIs allow you to obtain the quantization parameters for specific output streams or virtual streams. You can then access the qp_scale and qp_zp values from the returned hailo_quant_info_t structures to understand how the output data is scaled and quantized.

You can also include many images in your calibration dataset that have black (value 0) pixels and bright (value 255) pixels to force the statistics to represent the full range.