Custom python Post processing function in gstreamer performance issues

I’ve managed to train a custom yolov8s_pose model which detects the 4 corners of a bed. I’ve compiled it to HEF and it’s running successfully on the hailo 8 on a raspberry pi.

I’m running it through gstreamer and using a python script for the postprocessing.

I found this post and I managed to extract the keypoint data.

The thing is that this runs really slow. We’re talking 10 FPS. I managed to get it to around 20 fps with some improvements to the nms function but it’s still not fast enough.

Is this just a limitation of python or is there a better way to do this?

Ultimately I’d like to detect the bed and the person in the same model so will need a way to do the post processing with multiple classes but I’ll figure that out later.

The run function looks like this:

def run(video_frame: VideoFrame):

    class_num = 1
    regression_length=15
    raw_detections_keys = list(tensor.name() for tensor in video_frame.roi.get_tensors())
    raw_detections = {tensor.name(): np.expand_dims(np.array(tensor), axis=0)  for tensor in video_frame.roi.get_tensors()}
    layer_from_shape = {raw_detections[key].shape: key for key in raw_detections_keys}
    detection_output_channels = (regression_length + 1) * 4  # (regression length + 1) * num_coordinates
    keypoints = 12 # 3 * number of corners of bed
    endnodes = [
        raw_detections[layer_from_shape[1, 20, 20, detection_output_channels]],
        raw_detections[layer_from_shape[1, 20, 20, class_num]],
        raw_detections[layer_from_shape[1, 20, 20, keypoints]],
        raw_detections[layer_from_shape[1, 40, 40, detection_output_channels]],
        raw_detections[layer_from_shape[1, 40, 40, class_num]],
        raw_detections[layer_from_shape[1, 40, 40, keypoints]],
        raw_detections[layer_from_shape[1, 80, 80, detection_output_channels]],
        raw_detections[layer_from_shape[1, 80, 80, class_num]],
        raw_detections[layer_from_shape[1, 80, 80, keypoints]]
    ]

    predictions_dict = extract_pose_estimation_results(endnodes, 640, 640, class_num)

    return Gst.FlowReturn.OK

Everything else is basically as it was in here except for the keypoint counts and no more self as it’s now functional.

Hi @rosslote
In our basic pipeline example we are running the post process using CPP code.
You can check it out here: tappas/core/hailo/libs/postprocesses/pose_estimation/yolov8pose_postprocess.cpp at master · hailo-ai/tappas · GitHub
You can try and switch to CPP if python is not fast enough.
I would be happy to see your code added to our GitHub under “Community Projects” sounds like your work can be helpful to other people as well.

I’d love to contribute but I need to get it fully working first.

So I think I was being a bit daft. The max_detections was set to 300 which was really upping the itertations. I only really need to detect 1 so I’ve taken that down and I can get a clean 40FPS+. I will take a look at converting it to cpp when I get it working.

I’ve just passed it to the overlay filter and it’s not showing anything because my bbox values are all coming back as nan so i need to work out why that is.

My keypoints seem to be all over the place too which i’m guessing is a problem with the translation from onnx to hef as the model was working fine in the original onnx format.

When optimizing/compiling using the dfc what happens if I don’t add a --model-script? Also, do I have to include the same script in both the optimise and compile steps? Should I use the .alls scripts from the model_zoo? I honestly don’t know what any of these do as I’m a complete noob here. If you could point me in the direction of some documentation of give me a quick overview that would be great. One thing which threw me off was the quantization_param(output_layer3, precision_mode=a16_w16) lines. Where are these layers? I’ve checked on netron and can’t find any mention of output_layer so I don’t know what they mean.

I’ve found an issue and I think I’m lacking the correct knowledge to figure this out.

The _softmax in the post processor is giving this warning:

/home/ross.lote/Code/hailo-rpi5-examples/basic_pipelines/bed_postprocessor_util.py:241: RuntimeWarning: overflow encountered in exp
  return np.exp(x) / np.expand_dims(np.sum(np.exp(x), axis=-1), axis=-1)
/home/ross.lote/Code/hailo-rpi5-examples/basic_pipelines/bed_postprocessor_util.py:241: RuntimeWarning: invalid value encountered in divide
  return np.exp(x) / np.expand_dims(np.sum(np.exp(x), axis=-1), axis=-1)

The min max of x is:
min: 18
max: 249

Should these be scaled somehow? What sort of values are expected in this function?

I saw in this post that.

But what I print out my raw_detections it shows something like this:

[[153, 155, 139, ..., 128, 126, 126],
 [141, 147, 145, ..., 121, 119, 121],
 [132, 143, 142, ..., 118, 116, 119],
 ...,
 [134, 144, 143, ..., 120, 119, 121],
 [139, 147, 143, ..., 123, 121, 123],
 [144, 149, 143, ..., 124, 122, 123]]]], dtype=uint8), 'yolov8s_bed/conv44': array([[[[0],

Could this be the issue? Are these supposed to be floats from 0 - 1?

@rosslote
Yes, the output tensors need to be dequantized. Every output tensor has its own zero point and scale. You can use float_value=scale*(quant_value-zero_point)

Where can I get this info?

I tried this:

    raw_detections = {
        tensor.name(): np.expand_dims(
            np.array(tensor, copy=False).astype(np.float32) / 255.0, axis=0
        )
        for tensor in video_frame.roi.get_tensors()
    }

which got me past the errors but now I have other issues I think.

If I run dir on the tensors I get:

'data', 'features', 'fix_scale', 'get', 'get_full_percision', 'height', 'name', 'shape', 'size', 'vstream_info', 'width'

Are these useful?

Hi @rosslote
This is how you get the quantization parameters info

from hailo_platform import HEF


hef = HEF("your_hef_path")
output_vstream_info = hef.get_output_vstream_infos()

print("Outputs")
for output_info in output_vstream_info:
  print(output_info)
  print("Scale: {}".format(output_info.quant_info.qp_scale))
  print("Zero point: {}\n".format(output_info.quant_info.qp_zp))

I’m doing this inside a python postprocess script

so need a way to get that info from there.

My code currently looks like this:

def run(video_frame: VideoFrame):

    post_processor = PoseEstPostProcessing(
        max_detections=1,
        score_threshold=0.001,
        nms_iou_thresh=0.7,
        regression_length=15,
        strides=[8, 16, 32]
    )

    class_num = 1
    raw_detections = {
        tensor.name(): np.expand_dims(
            np.array(tensor, copy=False).astype(np.float32) / 255.0, axis=0
        )
        for tensor in video_frame.roi.get_tensors()
    }

    result = post_processor.post_process(raw_detections, 640, 640, class_num)
    for (bbox, score, keypts, joint_score) in zip(result['bboxes'][0], result['scores'][0], result['keypoints'][0], result['joint_scores'][0]):
        xmin, ymin, w, h = [float(x) for x in bbox]
        print("box", xmin, ymin, w, h)
        bbox = hailo.HailoBBox(xmin, ymin, w, h)
        detection = hailo.HailoDetection(bbox, "Bed", score[0])

        hailo_points = []
        print("kpts", keypts)
        for pt in keypts:
            hailo_points.append(hailo.HailoPoint(pt[0], pt[1], joint_score[0]))

        landmarks = hailo.HailoLandmarks('yolo', hailo_points, 0, JOINT_PAIRS)

        detection.add_object(landmarks)

        video_frame.roi.add_object(detection)

And the output of the prints are

box 127.14583480358124 148.2589807510376 551.6020673513412 576.5947957038879
kpts [[346.10195923 367.56079102]
 [335.81176758 396.67449951]
 [367.43530273 366.80783081]
 [335.56079102 411.7333374 ]]

Which look correct(ish) if they are pixel values.

Alternatively this could all be complete garbage but I can’t know until I see it drawn on my video output.

Thanks for the help, I really appreciate it.

Could you point out here what might need to change?

Hi @rosslote
In your code, you are sending float values to your postprocessor but there is an assumption that the conversion from int to float is just dividing by 255, which is not always true. Since, dir(tensor) showed vstream_info, you can print vstream_info for every tensor and share what you see. We can then write a simple function that dequantzies the output tensors based on this info.

Frustratingly:

try:
    info = tensor.vstream_info()
except Exception as e:
    print(e)

Unable to convert function return value to a Python type! The signature was
        (self: hailo.HailoTensor) -> hailo_vstream_info_t

Just for my sanity I ran you suggestion separately and got:

Outputs
VStreamInfo("yolov8s_bed/conv70")
Scale: 0.1039922684431076
Zero point: 168.0

VStreamInfo("yolov8s_bed/conv71")
Scale: 0.003921568859368563
Zero point: 0.0

VStreamInfo("yolov8s_bed/conv72")
Scale: 0.0006715772324241698
Zero point: 16007.0

VStreamInfo("yolov8s_bed/conv57")
Scale: 0.09461547434329987
Zero point: 151.0

VStreamInfo("yolov8s_bed/conv58")
Scale: 0.003921568859368563
Zero point: 0.0

VStreamInfo("yolov8s_bed/conv59")
Scale: 0.0004394233983475715
Zero point: 22882.0

VStreamInfo("yolov8s_bed/conv43")
Scale: 0.15522289276123047
Zero point: 133.0

VStreamInfo("yolov8s_bed/conv44")
Scale: 0.003921568859368563
Zero point: 0.0

VStreamInfo("yolov8s_bed/conv45")
Scale: 0.00035622910945676267
Zero point: 19781.0

@rosslote
Ok, this is good news. We just need to figure out how to get the same info from output tensors. Let me see if I can find info on this. Someone from Hailo team would know the answer.

@giladn Any idea what this error could be?

@rosslote
Hey, i would like to share some thoughts… seems like you have similar issue what i had…
Low FPS - Can be improved using cpp based post processing. (In python i was having multiple issues. I was not able to achieve 8-10max per camera with almost full cpu and second issue was that the program crashes very oftem due to memory kept increase). Using cpp based pp increases fps to 25-30 per camera(rpi5). But i limited to 15 for ideal cpu uses.

Regarding your dequantization issue -
I believe the default quantization is set 8bit for bounding box and 16 bit for keypoints… and i was not able figure out the solution in python.(but if you set the quantization to fully 8bit. It will work. Points will be correct).
Reason - The python APIs are just wrapper around c++ and they are converting only to uint_8(as far as i remember)… you can take a look for more detailed info.
Solution - convert your model to fully 8 bit.

In cpp post processing you will have same issue. But i was able to build a fully 16 bits postprocessing by adjusting some of their convertion code from uint_8 to uint_16.
Please take a look at this thread- Hey I want to build my own custom postprocessing .so - #8 by saurabh

Compare your hef output layers precision modes like a did in the shared post.

I’m not sure how to inspect the tensor types but when I convert them to numpy arrays they are all unit8

for tensor in video_frame.roi.get_tensors():
    print(tensor.name(), np.array(tensor).dtype)

yolov8s_bed/conv43 uint8
yolov8s_bed/conv44 uint8
yolov8s_bed/conv45 uint8
yolov8s_bed/conv57 uint8
yolov8s_bed/conv58 uint8
yolov8s_bed/conv59 uint8
yolov8s_bed/conv70 uint8
yolov8s_bed/conv71 uint8
yolov8s_bed/conv72 uint8

I managed to solve the FPS issue by reducing the max_detections.

There seems to be a lot of bad info out there are moment. I’m not sure if it’s due to changing apis or not. For example:

Just gives this error:

a bytes-like object is required, not 'int'

Looks like data is just a number not an array.

Also:

But my tensor is not dequantizing when I call this.

@shashi using the info you gave me I implemented this:

tensor_info = {
    "yolov8s_bed/conv70": {"scale": 0.1039922684431076, "zp": 168.0},
    "yolov8s_bed/conv71": {"scale": 0.003921568859368563, "zp": 0.0},
    "yolov8s_bed/conv72": {"scale": 0.0006715772324241698, "zp": 16007.0},
    "yolov8s_bed/conv57": {"scale": 0.09461547434329987, "zp": 151.0},
    "yolov8s_bed/conv58": {"scale": 0.003921568859368563, "zp": 0.0},
    "yolov8s_bed/conv59": {"scale": 0.0004394233983475715, "zp": 22882.0},
    "yolov8s_bed/conv43": {"scale": 0.15522289276123047, "zp": 133.0},
    "yolov8s_bed/conv44": {"scale": 0.003921568859368563, "zp": 0.0},
    "yolov8s_bed/conv45": {"scale": 0.00035622910945676267, "zp": 19781.0},
}

def dequantize_tensor(tensor):
    info = tensor_info[tensor.name()]
    zero_point = info['zp']
    scale = info['scale']
    return scale * (np.array(tensor, copy=False) - zero_point)

I hard coded the values for now until I figure out how to get this info dynamically.

Is this correct?

The output of my bbox and keypoints now looks like this:

bbox 243.2337546426842 191.12009224489435 467.467703175785 471.105251257167
kpts [[-362.5803538  -333.33113414]
 [-365.2881532  -328.38832571]
 [-361.84967777 -333.46007697]
 [-365.33113414 -335.13633374]]

The bbox is correct for pixel values but hailo overlay seems to want percentages so if I do this:

        xmin, ymin, w, h = [float(x)/640 for x in bbox]
        bbox = hailo.HailoBBox(xmin, ymin, w-xmin, h-ymin)

It draws the box correctly.

The keypoints are all negative now though so I need to figure that out.

Would be nice if i could just get the values back from the postprocessor as percentage values so I guess I should make some adjustments. The only part I can see which uses the image dimensions is here

        for box_distribute, kpts, stride, _ in zip(raw_boxes, raw_kpts, strides, np.arange(3)):
            shape = [int(x / stride) for x in image_dims]
            grid_x = np.arange(shape[1]) + 0.5
            grid_y = np.arange(shape[0]) + 0.5
            grid_x, grid_y = np.meshgrid(grid_x, grid_y)
            ct_row = grid_y.flatten() * stride
            ct_col = grid_x.flatten() * stride

I don’t really understand what stride means yet though.

@rosslote I remember i had also checked the type in the python post process it was all 8 bit. It was not true when i checked with hailort cli.
I don’t remember the exact command of cli but you might try hailort help. You should see something “network info”.

These 3 lines are for keypoints o/p layers which will set in 16 bits precision mode. And i believe you have used this default configuration.

If you remove these line you will get all 8 bits.

You will have to customize the python post processing(provided by hailo team) according to the number of keypoints. Only need to replace at certain places.

Got it:

hailortcli parse-hef resources/yolov8s_bed_no_optimize.hef 
Architecture HEF was compiled for: HAILO8
Network group name: yolov8s_bed, Multi Context - Number of contexts: 2
    Network name: yolov8s_bed/yolov8s_bed
        VStream infos:
            Input  yolov8s_bed/input_layer1 UINT8, NHWC(640x640x3)
            Output yolov8s_bed/conv70 UINT8, FCR(20x20x64)
            Output yolov8s_bed/conv71 UINT8, NHWC(20x20x1)
            Output yolov8s_bed/conv72 UINT16, NHWC(20x20x12)
            Output yolov8s_bed/conv57 UINT8, FCR(40x40x64)
            Output yolov8s_bed/conv58 UINT8, NHWC(40x40x1)
            Output yolov8s_bed/conv59 UINT16, FCR(40x40x12)
            Output yolov8s_bed/conv43 UINT8, FCR(80x80x64)
            Output yolov8s_bed/conv44 UINT8, NHWC(80x80x1)
            Output yolov8s_bed/conv45 UINT16, FCR(80x80x12)