How to Quickly Switch and Execute Two Single Context HEF Models

I am building an application using YOLO and ResNet50 HEF models. These HEF models are compiled separately using DFC. My device has only a single Hailo-8 device physically.

Checking with hailortcli, both models are Single Context Models and run at over 100 FPS on benchmark.

I wrote the functions to call each HEF file in C using HailoRT (and call the function from python using cdll) . I referred to this example:

When running each model repeatedly, it is confirmed that the inference processing on Hailo-8 becomes faster from the second execution. I think the initial slowness is due to the overhead of loading parameters and controls, which becomes unnecessary from the second time onwards, resulting in faster execution.

However, when running the two models alternately by referencing the HailoRT example, loading is required each time and the performance is not good. Regarding the trade-off between latency and FPS, I prioritize FPS.

What is an efficient way to run an application using two models on a single device? In my use case, the two models do not share input images, as the bounding boxes detected by YOLOv5 are cropped before applying ResNet50.

According to the DFC documentation, there appears to be a method to integrate two or more models into a single HEF file using the join function. How does performance get affected when integrated with join_action=NONE? Also, are there any compile or run guide of JOIN?

P.S. Terms like “Network Group” and “Context” are used in DFC, but these terms are not clearly defined in the DFC documentation, so my understanding might be inaccurate.

Thanks in advance.

The link I referred to is this:

latency

You only have to initialize the program once when it runs. don’t repeat initialize when model switched.

Please refer to that “multi_network_vstream_example”.

I initialize the device and vstreams only when the program starts in init().
I attach the C code and Python code.

#include <stdio.h>
#include <stdint.h>
#include <assert.h>
#include <math.h>
#include "hailo/hailort.h"
#define HEF_COUNT (2)
#define MAX_HEF_PATH_LEN (255)
#define MAX_EDGE_LAYERS (32)
// #define SCHEDULER_TIMEOUT_MS (0)
// #define SCHEDULER_THRESHOLD (0)
static char HEF_FILES[HEF_COUNT][MAX_HEF_PATH_LEN] = {"yolov5s_personface.hef", "resnet50.hef"};

static hailo_vdevice vdevice = NULL;
static hailo_hef hef[HEF_COUNT] = {NULL};
static hailo_configure_params_t config_params = {0};
static hailo_configured_network_group network_groups[HEF_COUNT] = {NULL};
static size_t network_groups_size = 1;
static hailo_vstream_info_t output_vstreams_info[HEF_COUNT][MAX_EDGE_LAYERS] = {0};
static hailo_input_vstream input_vstreams[HEF_COUNT][MAX_EDGE_LAYERS] = {NULL};
static hailo_output_vstream output_vstreams[HEF_COUNT][MAX_EDGE_LAYERS] = {NULL};
static size_t num_input_vstreams[HEF_COUNT];
static size_t num_output_vstreams[HEF_COUNT];

#define READ_AND_DEQUANTIZE(hef_idx, out_idx, type, out, size)                                               \
    do                                                                                                       \
    {                                                                                                        \
        type buf[size];                                                                                      \
        hailo_status status = HAILO_UNINITIALIZED;                                                           \
        status = hailo_vstream_read_raw_buffer(output_vstreams[hef_idx][out_idx], buf, size * sizeof(type)); \
        assert(status == HAILO_SUCCESS);                                                                     \
        float scale = output_vstreams_info[hef_idx][out_idx].quant_info.qp_scale;                            \
        float zp = output_vstreams_info[hef_idx][out_idx].quant_info.qp_zp;                                  \
        for (int i = 0; i < size; i++)                                                                       \
            *out++ = scale * (buf[i] - zp);                                                                  \
    } while (0)

void infer_personface(
    unsigned char *input0,
    float *pred80,
    float *pred40,
    float *pred20)
{
    hailo_status status = HAILO_UNINITIALIZED;
    /* Feed Data */
    status = hailo_vstream_write_raw_buffer(input_vstreams[0][0], input0, 640 * 640 * 3);
    assert(status == HAILO_SUCCESS);
    status = hailo_flush_input_vstream(input_vstreams[0][0]);
    assert(status == HAILO_SUCCESS);

    READ_AND_DEQUANTIZE(0, 0, uint8_t, pred80, 80*80*21);
    READ_AND_DEQUANTIZE(0, 1, uint8_t, pred40, 40*40*21);
    READ_AND_DEQUANTIZE(0, 2, uint8_t, pred20, 20*20*21);
}

void infer_resnet50(
    unsigned char *input0,
    float *output0)
{
    hailo_status status = HAILO_UNINITIALIZED;
    /* Feed Data */
    status = hailo_vstream_write_raw_buffer(input_vstreams[1][0], input0, 224 * 224 * 3);
    assert(status == HAILO_SUCCESS);
    status = hailo_flush_input_vstream(input_vstreams[1][0]);
    assert(status == HAILO_SUCCESS);

    READ_AND_DEQUANTIZE(1, 0, uint8_t, output0, 1);
}

int init()
{
    hailo_status status = HAILO_UNINITIALIZED;
    hailo_vdevice_params_t params = {0};
    params.scheduling_algorithm = HAILO_SCHEDULING_ALGORITHM_ROUND_ROBIN;
    params.device_count = 1;
    status = hailo_create_vdevice(&params, &vdevice);
    assert(status == HAILO_SUCCESS);

    for (size_t hef_index = 0; hef_index < HEF_COUNT; hef_index++)
    {
        status = hailo_create_hef_file(&hef[hef_index], HEF_FILES[hef_index]);
        assert(status == HAILO_SUCCESS);

        status = hailo_init_configure_params(hef[hef_index], HAILO_STREAM_INTERFACE_PCIE, &config_params);
        assert(status == HAILO_SUCCESS);
        status = hailo_configure_vdevice(vdevice, hef[hef_index], &config_params, &network_groups[hef_index], &network_groups_size);
        assert(status == HAILO_SUCCESS);
        // Set scheduler's timeout and threshold for the first network group, in order to give priority to the second network group
        /*if (0 == hef_index) {
            status =  hailo_set_scheduler_timeout(network_groups[hef_index], SCHEDULER_TIMEOUT_MS, NULL);
            status =  hailo_set_scheduler_threshold(network_groups[hef_index], SCHEDULER_THRESHOLD, NULL);
        }*/
        // Make sure it can hold amount of vstreams for hailo_make_input/output_vstream_params
        hailo_input_vstream_params_by_name_t input_vstream_params[MAX_EDGE_LAYERS];
        hailo_output_vstream_params_by_name_t output_vstream_params[MAX_EDGE_LAYERS];

        size_t input_vstream_size = MAX_EDGE_LAYERS;
        size_t output_vstream_size = MAX_EDGE_LAYERS;

        status = hailo_make_input_vstream_params(network_groups[hef_index], true, HAILO_FORMAT_TYPE_AUTO,
                                                 input_vstream_params, &input_vstream_size);
        assert(status == HAILO_SUCCESS);
        num_input_vstreams[hef_index] = input_vstream_size;

        status = hailo_make_output_vstream_params(network_groups[hef_index], true, HAILO_FORMAT_TYPE_AUTO,
                                                  output_vstream_params, &output_vstream_size);
        assert(status == HAILO_SUCCESS);
        num_output_vstreams[hef_index] = output_vstream_size;

        status = hailo_create_input_vstreams(network_groups[hef_index], input_vstream_params, input_vstream_size, input_vstreams[hef_index]);
        assert(status == HAILO_SUCCESS);
        status = hailo_create_output_vstreams(network_groups[hef_index], output_vstream_params, output_vstream_size, output_vstreams[hef_index]);
        assert(status == HAILO_SUCCESS);

        for (size_t i = 0; i < output_vstream_size; i++)
        {
            hailo_get_output_vstream_info(output_vstreams[hef_index][i], &output_vstreams_info[hef_index][i]);
        }
    }
    return status;
}

void destroy()
{
    for (size_t hef_index = 0; hef_index < HEF_COUNT; hef_index++)
    {
        (void)hailo_release_output_vstreams(output_vstreams[hef_index], num_output_vstreams[hef_index]);
        (void)hailo_release_input_vstreams(input_vstreams[hef_index], num_input_vstreams[hef_index]);
    }
    for (size_t hef_index = 0; hef_index < HEF_COUNT; hef_index++)
    {
        if (NULL != hef[hef_index])
        {
            (void)hailo_release_hef(hef[hef_index]);
        }
    }
    (void)hailo_release_vdevice(vdevice);
}
from ctypes import cdll
import numpy as np
import time

def run_yolov5(lib, input):
    t1 = time.time()
    out0 = np.zeros((80, 80, 3, 7), dtype=np.float32)
    out1 = np.zeros((40, 40, 3, 7), dtype=np.float32)
    out2 = np.zeros((20, 20, 3, 7), dtype=np.float32)
    lib.infer_personface(
            input.ctypes.data,
            out0.ctypes.data,
            out1.ctypes.data,
            out2.ctypes.data)
    t2 = time.time()
    return t2 - t1

def run_resnet50(lib, input):
    t1 = time.time()
    out = np.zeros((1), dtype=np.float32)
    lib.infer_resnet50(
            input.ctypes.data,
            out.ctypes.data)
    t2 = time.time()
    return t2 - t1

lib = cdll.LoadLibrary("./libhailomodels.so")
input1 = np.zeros((640, 640, 3))
input2 = np.zeros((224, 224, 3))
lib.init()

print("====only yolo=====")
for i in range(10):
    time1 = run_yolov5(lib, input1)
    print("yolo {:.2f}ms".format(time1*1000))


print("====only resnet50=====")
for i in range(10):
    time1 = run_resnet50(lib, input1)
    print("resnet50 {:.2f}ms".format(time1*1000))

print("=====yolo + resnet50=====")
for i in range(10):
    time1 = run_yolov5(lib, input1)
    time2 = run_resnet50(lib, input2)
    print("yolo{:.2f}ms resnet50 {:.2f}ms".format(time1*1000, time2*1000))
lib.destroy()

Additionally, I saw in the C code that bigger difference size of the switched model was make slower. but it does not slow down very much as you’r attached result.

Thank you for your advice.

I am aware that on my device (Raspi CM4), execution time may be slower compared to cases where the full PCI bandwidth is available, due to limited PCI bandwidth.

With that in mind, have you tried joining two models into a single HEF file to achieve faster execution?

sorry i have no experience for joing two models into a single HEF file.
my style was each model must have hef.

I think that there is two ways to faster execution.

first, have each hef assign to one thread.

second, using multi-chip board. and assign one chip to one hef.

I testing two ways. and i got more faster execution.

Why are you running YOLOv5 and then a ResNet after that? Are you aware that YOLOv5 (and other Object Detection models) are capable of finding the boxes AND classifying the thing inside the box? It sounds like a better approach would be to train your own YOLOv5 (or YOLOv8) model, so that it can handle whatever job you’re trying to do with the ResNet.

It’s actually common approach, to re-inforce the classification on detected items. For example, detect for a generic road-sign (along side other classes), and then pass that to a classifier to say what is the exact road-sign is.
The benefits of this approach could be a simplified training phase of the detector, and also using a smaller/lighter detector.

1 Like

object detection + classification is very typical technic.

object detection = predict Region & Big category
classification = classfiy detail category

thanks.

Sorry for the delay, but here’s an update.

If the models are small enough (at least, each HEF file needs to be compiled with a Single Context), it is possible to join two models and run them in a Single Context.

I made some adjustments to the models to fit them into a Single Context:

  • Changed the input size of yolov5s to 640*384
  • Replaced the resnet50 model with resnet18

The process of joining the two models is part of the compilation process and occurs after quantization, so there is no accuracy degradation due to this process.

The two models can be joined and compiled with the following code. The compilation process is a resource allocation problem and takes a long time, so please be patient (in my case, it took about 2 hours).

from hailo_sdk_client import ClientRunner
from hailo_sdk_client import exposed_definitions,JoinAction
import re

def modify_script(script):
    pattern = r'performance_param\([^)]+\)'

    replacement = 'performance_param(compiler_optimization_level=max)'
    output_text = re.sub(pattern, replacement, script)
    output_text += "\nplatform_param(hints=[low_pcie_bandwidth])\n"
    return output_text

model_name_1 = 'yolov5s'  
model_name_2 = 'resnet18'  

# Create runner
runner1 = ClientRunner(hw_arch='hailo8', har="./yolov5s_quantized.har")
runner2 = ClientRunner(hw_arch='hailo8', har="./resnet18_quantized.har")

# print(runner2.model_script)
runner1.load_model_script(modify_script(runner1.model_script))
runner2.load_model_script(modify_script(runner2.model_script))
# Join runner
runner1.join(runner2, 
             scope1_name={'yolov5s':'yolov5s'}, 
             scope2_name={'resnet18':'resnet18'}, 
             join_action=JoinAction.NONE)

hef = runner1.compile()
file_name = f"joined.hef"
with open(file_name, 'wb') as f:
    f.write(hef)
print("Done!")

The part where alls is partially modified in modify_script() is just an attempt to optimize the model, so it’s not absolutely necessary.

$ hailortcli parse-hef joined.hef
Architecture HEF was compiled for: HAILO8
Network group name: joined_yolov5s_resnet18, Single Context
    Network name: joined_yolov5s_resnet18/resnet18
        VStream infos:
            Input  resnet18/input_layer1 UINT8, NHWC(224x224x3)
            Output resnet18/fc1 UINT8, NC(2)
    Network name: joined_yolov5s_resnet18/yolov5s
        VStream infos:
            Input  yolov5s/input_layer1 UINT8, NHWC(640x384x3)
            Output yolov5s/conv47 UINT8, FCR(80x48x24)
            Output yolov5s/conv54 UINT8, FCR(40x24x24)
            Output yolov5s/conv60 UINT8, FCR(20x12x24)
$ hailortcli benchmark joined.hef --no-power true
Starting Measurements...
Measuring FPS in hw_only mode
Network joined_yolov5s_resnet18/resnet18: 100% | 4795 | FPS: 319.64 | ETA: 00:00:00
Network joined_yolov5s_resnet18/yolov5s: 100% | 4600 | FPS: 306.64 | ETA: 00:00:00
Measuring FPS in streaming mode
Network joined_yolov5s_resnet18/resnet18: 100% | 4795 | FPS: 319.63 | ETA: 00:00:00
Network joined_yolov5s_resnet18/yolov5s: 100% | 4602 | FPS: 306.76 | ETA: 00:00:00
Measuring HW LatencyNetwork joined_yolov5s_resnet18/resnet18: 100% | 2405 | HW Latency: 5.73 ms | ETA: 00:00:00
Network joined_yolov5s_resnet18/yolov5s: 100% | 1888 | HW Latency: 6.59 ms | ETA: 00:00:00

=======
Summary
=======
FPS     (hw_only)                 = 313.151
        (streaming)               = 313.177
Latency (hw)                      = 6.15653 ms

After verifying the compiled model, I confirmed that it is a Single Context model containing both models. When testing with the benchmark command, I observed that 300 FPS was achieved!!

Additionally, when I ran the models alternately using Python code, I confirmed that inference continued quickly without switching contexts, even when alternating between the models. (Only the very first inference took time due to weight copying.)

(.venv) pi@raspberrypi:~/newmodel/app $ python3 test.py
====only yolo=====
yolo 41.81ms
yolo 8.76ms
yolo 7.83ms
yolo 7.85ms
yolo 7.94ms
yolo 8.13ms
yolo 7.83ms
yolo 7.81ms
yolo 7.84ms
yolo 7.88ms
====only resnet18=====
resnet18 6.43ms
resnet18 6.29ms
resnet18 6.30ms
resnet18 6.32ms
resnet18 6.30ms
resnet18 6.28ms
resnet18 6.42ms
resnet18 6.31ms
resnet18 6.33ms
resnet18 6.31ms
=====yolo + resnet18=====
yolo 7.76ms resnet18 6.40ms
yolo 7.76ms resnet18 6.26ms
yolo 7.84ms resnet18 6.31ms
yolo 7.94ms resnet18 6.23ms
yolo 8.75ms resnet18 6.23ms
yolo 7.74ms resnet18 6.30ms
yolo 7.95ms resnet18 6.28ms
yolo 7.91ms resnet18 6.23ms
yolo 7.73ms resnet18 6.15ms
yolo 7.75ms resnet18 6.23ms