MobileNetV3-SSDLite incorrect inference result

Youngwook_Kwon · April 24, 2025, 2:13pm

Converted MobileNetV3-SSDLite pretrained on the COCO dataset provided by torchvision to HEF.
However, we encountered a problem that the detection results were not good on real images.
NMS configuration and detection results are attached.

{
    "nms_scores_th": 0.001,
    "nms_iou_th": 0.55,
    "max_proposals_per_class": 20,
    "image_dims": [
        320,
        320
    ],
    "centers_scale_factor": 10,
    "bbox_dimensions_scale_factor": 5,
    "classes": 91,
    "background_removal": true,
    "background_removal_index": 0,
    "bbox_decoders": [
        {
            "name": "bbox_decoder26",
            "h": [
            	0.2,
                0.264575131106459,
                0.1414213562373095,
                0.28284271247461906,
                0.11547005383792516,
                0.34641016151377546
            ],
            "w": [
                0.2,
                0.264575131106459,
                0.28284271247461906,
                0.1414213562373095,
                0.34641016151377546,
                0.11547005383792516
            ],
            "reg_layer": "conv27",
            "cls_layer": "conv26"
        },
        {
            "name": "bbox_decoder35",
            "h": [
                0.35,
                0.4183300132670378,
                0.2474873734152916,
                0.4949747468305833,
                0.20207259421636903,
                0.606217782649107
            ],
            "w": [
                0.35,
                0.4183300132670378,
                0.4949747468305833,
                0.2474873734152916,
                0.606217782649107,
                0.20207259421636903
            ],
            "reg_layer": "conv36",
            "cls_layer": "conv35"
        },
        {
            "name": "bbox_decoder39",
            "h": [
                0.5,
                0.570087712549569,
                0.35355339059327373,
                0.7071067811865476,
                0.2886751345948129,
                0.8660254037844386
            ],
            "w": [
                0.5,
                0.570087712549569,
                0.7071067811865476,
                0.35355339059327373,
                0.8660254037844386,
                0.2886751345948129
            ],
            "reg_layer": "conv40",
            "cls_layer": "conv39"
        },
        {
            "name": "bbox_decoder43",
            "h": [
                0.65,
                0.7211102550927979,
                0.4596194077712559,
                0.9192388155425119,
                0.37527767497325676,
                1.12583302491977
            ],
            "w": [
                0.65,
                0.7211102550927979,
                0.9192388155425119,
                0.4596194077712559,
                1.12583302491977,
                0.37527767497325676
            ],
            "reg_layer": "conv44",
            "cls_layer": "conv43"
        },
        {
            "name": "bbox_decoder47",
            "h": [
                0.8,
                0.8717797887081347,
                0.565685424949238,
                1.1313708498984762,
                0.46188021535170065,
                1.3856406460551018
            ],
            "w": [
                0.8,
                0.8717797887081347,
                1.1313708498984762,
                0.565685424949238,
                1.3856406460551018,
                0.46188021535170065
            ],
            "reg_layer": "conv48",
            "cls_layer": "conv47"
        },
        {
            "name": "bbox_decoder50",
            "h": [
                0.95,
                0.9746794344808963,
                0.67175144212722,
                1.3435028842544403,
                0.5484827557301445,
                1.6454482671904334
            ],
            "w": [
                0.95,
                0.9746794344808963,
                1.3435028842544403,
                0.67175144212722,
                1.6454482671904334,
                0.5484827557301445
            ],
            "reg_layer": "conv51",
            "cls_layer": "conv50"
        }
    ]
}

This results in a detection score of 0.99 or higher.

Your help would be appreciated.

omria · April 29, 2025, 3:37pm

Hey @Youngwook_Kwon ,

From The Config, These Issues Might Be Breaking Detection (and How to Fix Them)

1. Output scaling mismatch
Your current config has:

centers_scale_factor: 10
bbox_dimensions_scale_factor: 5

MobileNetV3-SSDLite usually doesn’t need those — the outputs are already normalized.

Set both scaling factors to 1.

2. Anchor box / aspect ratio mismatch
You manually set anchor sizes, We recommend matching your decoder anchors to the original torchvision config.

3. Background removal mismatch
You have:

"background_removal": true
"background_removal_index": 0

Correct if the background class is really 0 , Double-check your model config just in case.

4. NMS score threshold too low
Setting nms_scores_th=0.001 allows way too many low-confidence boxes to pass into NMS.

Increase it to 0.05 or 0.1.

5. Input normalization mismatch
Make sure you normalize images exactly like torchvision expects:

normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                  std=[0.229, 0.224, 0.225])

Also, make sure inputs are RGB, not BGR.

Here’s a Config You Should Try

{
    "nms_scores_th": 0.05,
    "nms_iou_th": 0.5,
    "max_proposals_per_class": 100,
    "image_dims": [320, 320],
    "centers_scale_factor": 1,
    "bbox_dimensions_scale_factor": 1,
    "classes": 91,
    "background_removal": true,
    "background_removal_index": 0,
    "bbox_decoders": [
        {
            "name": "bbox_decoder26",
            "h": [0.1, 0.2, 0.2, 0.1, 0.15, 0.15],
            "w": [0.1, 0.2, 0.2, 0.1, 0.15, 0.15],
            "reg_layer": "conv27",
            "cls_layer": "conv26"
        },
        {
            "name": "bbox_decoder35",
            "h": [0.2, 0.3, 0.3, 0.2, 0.25, 0.25],
            "w": [0.2, 0.3, 0.3, 0.2, 0.25, 0.25],
            "reg_layer": "conv36",
            "cls_layer": "conv35"
        },
        {
            "name": "bbox_decoder39",
            "h": [0.35, 0.45, 0.45, 0.35, 0.4, 0.4],
            "w": [0.35, 0.45, 0.45, 0.35, 0.4, 0.4],
            "reg_layer": "conv40",
            "cls_layer": "conv39"
        },
        {
            "name": "bbox_decoder43",
            "h": [0.5, 0.6, 0.6, 0.5, 0.55, 0.55],
            "w": [0.5, 0.6, 0.6, 0.5, 0.55, 0.55],
            "reg_layer": "conv44",
            "cls_layer": "conv43"
        },
        {
            "name": "bbox_decoder47",
            "h": [0.65, 0.75, 0.75, 0.65, 0.7, 0.7],
            "w": [0.65, 0.75, 0.75, 0.65, 0.7, 0.7],
            "reg_layer": "conv48",
            "cls_layer": "conv47"
        },
        {
            "name": "bbox_decoder50",
            "h": [0.8, 0.9, 0.9, 0.8, 0.85, 0.85],
            "w": [0.8, 0.9, 0.9, 0.8, 0.85, 0.85],
            "reg_layer": "conv51",
            "cls_layer": "conv50"
        }
    ]
}

Make sure:

Input normalization uses [0.485, 0.456, 0.406] mean and [0.229, 0.224, 0.225] std.
Image input is RGB.

If normalization or color is wrong, your detections will still be bad even if anchors and scaling are correct.

Topic		Replies	Views
Poor Inference Results of My MobileNetV2-UNet on Hailo8L General	10	97	February 27, 2025
RPI5-Hailo8L with SSD mobilenet General raspberry-pi	5	327	February 13, 2025
NMS is not working correctly? General hailo8	8	115	March 25, 2025
Converting Yolov11 model trained using Custom Dataset into .HEF General hef , raspberry-pi , hailo8	1	260	February 22, 2025
Strong Performance Degradation after conversion from HAR to HEF General hailort , raspberry-pi , hailo8	6	244	December 8, 2024

MobileNetV3-SSDLite incorrect inference result

Here’s a Config You Should Try

Related topics