Performance degradation in native model while compression=-100

Luuk_Romeijn · August 13, 2025, 9:21pm

I’m using the hailomz compile command of the Hailo model zoo to convert my Yolov8x model to a .hef file that I can run on my Hailo8 chip.

While the compiled model makes similar predictions as the original Ultralytics/PyTorch model, there is some degradation that I want to eliminate as much as possible.

To inspect at which step most degradation happens, I’ve emulated the model’s performance at different stages using InferenceContext.SDK_FP_OPTIMIZED(FP optimized), InferenceContext.SDK_QUANTIZED(quantized) for different optimization levels, and the .predict method of the original YOLO model when importing the PyTorch / ONNX file.

Also, I ran hailomz compile with optimization level -100 to obtain a .har file that should not have undergone any changes w.r.t. the original .onnx input, and extract post-processed output using InferenceContext.SDK_FP_OPTIMIZED (which is unoptimized since optimization level = -100).

The result is shown below. Quite notably:

The performance degrades significantly after only parsing.
The performance hardly changes when increasing the optimization level

To further investigate, I’ve also looked at the InferenceContext.SDK_NATIVE output, comparing it to the output of the original PyTorch model by adding forward hooks to the corresponding end nodes. Strangely enough, the outputted values of these layers do not match at all (even though their sizes match well).

Any thoughts on this? Specifically:

How come the performance degrades after parsing alone?
How come that, after parsing, the network seems to have substantially changed and does not return the same values anymore?
Any steps I need to double-check or verify?

nina-vilela · August 14, 2025, 11:02am

Hi @Luuk_Romeijn,

Could you please share the code that you’ve used for comparing ONNX vs Hailo Native?

Luuk_Romeijn · August 14, 2025, 3:25pm

On my GPU server, I run:

def process_raw_hailo_results(raw_results):
    results = []
    for result in raw_results:
        class1_result = result[0].transpose()
        class2_result = result[1].transpose()
        # Filter out the empty predictions
        class1_result = class1_result[class1_result.sum(axis=1) != 0]
        class2_result = class2_result[class2_result.sum(axis=1) != 0]
        results.append([class1_result, class2_result])
    return results

# Inference for ONNX
model = ultralytics.models.yolo.model.YOLO(onnx_filepath, task='detect')
onnx_results = [model.predict(img)[0].boxes.data for img in imgs_list]

# Inference for Un-optimized HAR
runner = ClientRunner(har=f'{model_dir}/opt_100.har')
with runner.infer_context(InferenceContext.SDK_FP_OPTIMIZED) as ctx:
    raw_results = runner.infer(ctx, imgs_array)
hailo_results = process_raw_hailo_results(raw_results)

Then locally I parse the results as follows (included here for completion). Basically I scale to 2144 range and make sure the axes order of Hailo matches that of Ultralytics.

def get_ultra_preds(predictions, resize=False):
    for pred in predictions:
        if resize:
            pred[:,:4] = pred[:,:4] / 448 * 2144 # Ensure output in 2144 range
    predictions = [pred.to('cpu') for pred in predictions]
    return predictions

def get_hailo_preds(predictions):
    hailo_output = []
    for img in output_list:
        class1_detections = img[0]
        class1_detections = np.column_stack((class1_detections, np.zeros(shape=(class1_detections.shape[0]),dtype=np.float32)))
        class2_detections = img[1]
        class2_detections = np.column_stack((class2_detections, np.ones(shape=(class2_detections.shape[0]),dtype=np.float32)))
        img_detections = np.concatenate((class1_detections, class2_detections))
        img_detections[:,:4] = img_detections[:,:4]*2144
        img_detections = img_detections[:,[1,0,3,2,4,5]]
        img_detections = torch.tensor(img_detections)
        hailo_output.append(img_detections)
    return hailo_output

This gives me an output that I can input to Ultralytics built-in evaluation method:

from ultralytics.utils.metrics import ConfusionMatrix

def evaluate(predictions, references, filepaths):
    cm = ConfusionMatrix(['class1Box', 'class2Box'])
    for i in range(len(filepaths)):
        preds, refs = predictions[i], references[i]
        preds, refs = get_dict(preds), get_dict(refs)
        cm.process_batch(preds, refs, conf=0.4, iou_thres=0.8)
    return cm

ultra_preds = get_ultra_preds(onnx_results)
hailo_preds = get_hailo_preds(hailo_results)
results = evaluate(preds, annotation, image_paths)
        results = calculate_metrics(results)[['Accuracy', 'Precision', 'Recall']]

Luuk_Romeijn · August 14, 2025, 4:25pm

I just realized that what I shared doesn’t actually answer your question.

Here’s what I use to compare the original PyTorch model to SDK_NATIVE:

def get_layer_output(name):
    def hook(model, input, output):
        output = output.detach().cpu().numpy()
        b, c, w, h = output.shape
        output = output.reshape((b, w, h, c))
        layer_outputs[name] = output
    return hook

 PyTorch inference
model = ultralytics.models.yolo.model.YOLO(pt_filepath).to('cpu')
model.model.model[22].cv2[0][2].register_forward_hook(get_layer_output('/model.22/cv2.0/cv2.0.2/Conv'))
model.model.model[22].cv3[0][2].register_forward_hook(get_layer_output('/model.22/cv3.0/cv3.0.2/Conv'))
model.model.model[22].cv2[1][2].register_forward_hook(get_layer_output('/model.22/cv2.1/cv2.1.2/Conv'))
model.model.model[22].cv3[1][2].register_forward_hook(get_layer_output('/model.22/cv3.1/cv3.1.2/Conv'))
model.model.model[22].cv2[2][2].register_forward_hook(get_layer_output('/model.22/cv2.2/cv2.2.2/Conv'))
model.model.model[22].cv3[2][2].register_forward_hook(get_layer_output('/model.22/cv3.2/cv3.2.2/Conv'))
results = [result.boxes.data for result in model.predict(imgs_list)]

# Parsed har inference
runner = ClientRunner(har=f'{model_dir.__str__()}/opt100.har')
with runner.infer_context(InferenceContext.SDK_NATIVE) as ctx:
    raw_results = runner.infer(ctx, imgs_array)
for hailo_result, output_layer in zip(raw_results, output_layers):
    ultra_result = layer_outputs[output_layer]
    print(ultra_result.mean(), ultra_result.std())
    print(hailo_result.mean(), hailo_result.std())
    print(f"MAE({output_layer}):", np.abs(hailo_result - ultra_result).mean())

The correct end node configuration I get from this line of hailomz compile:

2025-08-13 13:18:28,579 - INFO - parser.py:379 - End nodes mapped from original model: '/model.22/cv2.0/cv2.0.2/Conv', '/model.22/cv3.0/cv3.0.2/Conv', '/model.22/cv2.1/cv2.1.2/Conv', '/model.22/cv3.1/cv3.1.2/Conv', '/model.22/cv2.2/cv2.2.2/Conv', '/model.22/cv3.2/cv3.2.2/Conv'.

The output is as follows:

1.0022682 2.7083056
0.99228346 13.710399
MAE(/model.22/cv2.0/cv2.0.2/Conv): 9.450299
-18.617985 6.3078165
-28.371376 24.733833
MAE(/model.22/cv3.0/cv3.0.2/Conv): 20.06694
1.0001894 1.3281859
0.99943984 3.1865597
MAE(/model.22/cv2.1/cv2.1.2/Conv): 2.5961354
-14.350415 2.342508
-40.893635 26.539469
MAE(/model.22/cv3.1/cv3.1.2/Conv): 28.249865
1.0000402 0.8874368
0.997231 2.039753
MAE(/model.22/cv2.2/cv2.2.2/Conv): 1.6544698
-11.736422 0.96825224
-14.992005 3.7051718
MAE(/model.22/cv3.2/cv3.2.2/Conv): 3.9806633

nina-vilela · August 17, 2025, 7:57am

What’s the difference between imgs_arrayand imgs_list?

Also, I’m sending you a pm

Luuk_Romeijn · August 18, 2025, 8:01am

They’re basically the same, except Ultralytics takes lists of arrays as input whereas for Hailo I need to stack the arrays into a single large one:

imgs_list = [cv2.resize(cv2.imread(val_dir / img_path), (448, 448)) for img_path in sorted(os.listdir(val_dir)) if img_path.endswith('jpg')]

imgs_array = np.stack(imgs_list)

nina-vilela · August 18, 2025, 1:55pm

Just to confirm, PyTorch uses NCHW but Hailo takes NHWC. Are you doing that input conversion?
And I do not see normalization in your pipeline. Is that because the training did not use it?

shashi · August 18, 2025, 3:05pm

Hi @Luuk_Romeijn

opencv reads images in BGR format and ultralytics does this conversin of BGR to RGB inside their predict call. You should send RGB data into the predict call.

@nina-vilela I think NCHW vs NHWC should not be an issue here as predict would have failed if tensor of wrong shape was sent. Also, cv2’s imread returns NHWC.

nina-vilela · August 18, 2025, 3:21pm

@shashi Good catch on the BGR.

I double-checked, and Ultralytics takes care of converting to channel-first during preprocessing, so indeed no issue there.

Luuk_Romeijn · August 19, 2025, 10:04am

Thanks! Turns out the issue was indeed that cv2.imread has BGR output while the Hailo model expects RGB. Interestingly, PyTorch takes BGR input and converts it to RGB under the hood before sending it through the model.

Here’s the new plot:

I’m happy to see that the precision stays the same. Based on inspecting some images, I think the slight recall drop is caused by tiny differences in the NMS implementation between Hailo and Ultralytics, even though I made sure to set the values for NMS_IOU and NMS_CONFIDENCE the same.