custom yolo arch (yolov5-p2, for small objects) - optimized model getting 0 mAP

I’ve trained a yolov5-p2 model on VisDrone and have some decent results (mAP@50 = 0.6)

But I’m unable to port this over to Hailo.

onnx output looks like so:

netron screenshot

I use that to obtain this code fro parsing:

runner = ClientRunner(hw_arch=chosen_hw_arch)
hn, npz = runner.translate_onnx_model(
    onnx_path,
    onnx_model_name,
    start_node_names=["/model.0/conv/Conv"],
    end_node_names=["/model.31/m.3/Conv", 
                   "/model.31/m.2/Conv",
                   "/model.31/m.1/Conv",
                   "/model.31/m.0/Conv"],
    net_input_shapes={"/model.0/conv/Conv": [1, 3, 640, 640]},
)

This succeeds and when I check my runner.get_hn, I’ll have these output layers:

output layers in parsed model
('yolov5np2_visdrone/output_layer1',
              OrderedDict([('type', 'output_layer'),
                           ('input', ['yolov5np2_visdrone/conv89']),
                           ('output', []),
                           ('input_shapes', [[-1, 160, 160, 18]]),
                           ('output_shapes', [[-1, 160, 160, 18]]),
                           ('original_names', ['out']),
                           ('compilation_params', {}),
                           ('quantization_params', {}),
                           ('transposed', False),
                           ('engine', 'nn_core'),
                           ('io_type', 'standard')])),
             ('yolov5np2_visdrone/output_layer2',
              OrderedDict([('type', 'output_layer'),
                           ('input', ['yolov5np2_visdrone/conv99']),
                           ('output', []),
                           ('input_shapes', [[-1, 80, 80, 18]]),
                           ('output_shapes', [[-1, 80, 80, 18]]),
                           ('original_names', ['out']),
                           ('compilation_params', {}),
                           ('quantization_params', {}),
                           ('transposed', False),
                           ('engine', 'nn_core'),
                           ('io_type', 'standard')])),
             ('yolov5np2_visdrone/output_layer3',
              OrderedDict([('type', 'output_layer'),
                           ('input', ['yolov5np2_visdrone/conv111']),
                           ('output', []),
                           ('input_shapes', [[-1, 40, 40, 18]]),
                           ('output_shapes', [[-1, 40, 40, 18]]),
                           ('original_names', ['out']),
                           ('compilation_params', {}),
                           ('quantization_params', {}),
                           ('transposed', False),
                           ('engine', 'nn_core'),
                           ('io_type', 'standard')])),
             ('yolov5np2_visdrone/output_layer4',
              OrderedDict([('type', 'output_layer'),
                           ('input', ['yolov5np2_visdrone/conv121']),
                           ('output', []),
                           ('input_shapes', [[-1, 20, 20, 18]]),
                           ('output_shapes', [[-1, 20, 20, 18]]),
                           ('original_names', ['out']),
                           ('compilation_params', {}),
                           ('quantization_params', {}),
                           ('transposed', False),
                           ('engine', 'nn_core'),
                           ('io_type', 'standard')]))])

================================================================================

My next step was creating an nms config, which I did so by first extracting the anchors from the onnx model. With 4 output nodes I assumed I’d need 4 decoders. This is how my config file looks:

nms config file
{
    "nms_scores_th": 0.001,
    "nms_iou_th": 0.6,
    "image_dims": [
        640,
        640
    ],
    "max_proposals_per_class": 100,
    "background_removal": false,
    "classes": 1,
    "bbox_decoders": [
        {
            "name": "bbox_decoder89",
            "w": [
                2.01172,
                2.68945,
                4.41016
            ],
            "h": [
                3.97266,
                5.95312,
                5.50000
            ],
            "stride": 8,
            "encoded_layer": "conv89"
        },
        {
            "name": "bbox_decoder99",
            "w": [
                3.53125,
                5.31641,
                5.08594
            ],
            "h": [
                8.80469,
                8.64062,
                12.39062
            ],
            "stride": 16,
            "encoded_layer": "conv99"
        },
        {
            "name": "bbox_decoder111",
            "w": [
                8.03906,
                6.73438,
                9.27344
            ],
            "h": [
                10.12500,
                15.82812,
                17.54688
            ],
            "stride": 32,
            "encoded_layer": "conv111"
        },
        {
            "name": "bbox_decoder121",
            "w": [
                11.79688,
                15.61719,
                34.09375
            ],
            "h": [
                21.31250,
                29.06250,
                36.06250
            ],
            "stride": 64,
            "encoded_layer": "conv121"
        }
    ]
}

Finally the alls script:

alls script
alls =  """
normalization1 = normalization([0.0, 0.0, 0.0], [255.0, 255.0, 255.0])
change_output_activation(sigmoid)
model_optimization_config(calibration, batch_size=8, calibset_size=64)
post_quantization_optimization(finetune, policy=enabled, learning_rate=0.00001, epochs=4, batch_size=8, dataset_size=1024)
nms_postprocess("/local/shared_with_docker/visdrone/postprocess_config/yolov5np2v6_nms_config_custom.json", yolov5, engine=cpu)
performance_param(compiler_optimization_level=max)
allocator_param(width_splitter_defuse=disabled)

# """

However when I run inference, even though I get some expected total detection counts, the detections are all wrong, getting mAP@50 = 0 :confused:

such as this image:

bad detections

Any ideas where I’m going wrong here?