MobileNetV3-SSDLite incorrect inference result

Converted MobileNetV3-SSDLite pretrained on the COCO dataset provided by torchvision to HEF.
However, we encountered a problem that the detection results were not good on real images.
NMS configuration and detection results are attached.

{
    "nms_scores_th": 0.001,
    "nms_iou_th": 0.55,
    "max_proposals_per_class": 20,
    "image_dims": [
        320,
        320
    ],
    "centers_scale_factor": 10,
    "bbox_dimensions_scale_factor": 5,
    "classes": 91,
    "background_removal": true,
    "background_removal_index": 0,
    "bbox_decoders": [
        {
            "name": "bbox_decoder26",
            "h": [
            	0.2,
                0.264575131106459,
                0.1414213562373095,
                0.28284271247461906,
                0.11547005383792516,
                0.34641016151377546
            ],
            "w": [
                0.2,
                0.264575131106459,
                0.28284271247461906,
                0.1414213562373095,
                0.34641016151377546,
                0.11547005383792516
            ],
            "reg_layer": "conv27",
            "cls_layer": "conv26"
        },
        {
            "name": "bbox_decoder35",
            "h": [
                0.35,
                0.4183300132670378,
                0.2474873734152916,
                0.4949747468305833,
                0.20207259421636903,
                0.606217782649107
            ],
            "w": [
                0.35,
                0.4183300132670378,
                0.4949747468305833,
                0.2474873734152916,
                0.606217782649107,
                0.20207259421636903
            ],
            "reg_layer": "conv36",
            "cls_layer": "conv35"
        },
        {
            "name": "bbox_decoder39",
            "h": [
                0.5,
                0.570087712549569,
                0.35355339059327373,
                0.7071067811865476,
                0.2886751345948129,
                0.8660254037844386
            ],
            "w": [
                0.5,
                0.570087712549569,
                0.7071067811865476,
                0.35355339059327373,
                0.8660254037844386,
                0.2886751345948129
            ],
            "reg_layer": "conv40",
            "cls_layer": "conv39"
        },
        {
            "name": "bbox_decoder43",
            "h": [
                0.65,
                0.7211102550927979,
                0.4596194077712559,
                0.9192388155425119,
                0.37527767497325676,
                1.12583302491977
            ],
            "w": [
                0.65,
                0.7211102550927979,
                0.9192388155425119,
                0.4596194077712559,
                1.12583302491977,
                0.37527767497325676
            ],
            "reg_layer": "conv44",
            "cls_layer": "conv43"
        },
        {
            "name": "bbox_decoder47",
            "h": [
                0.8,
                0.8717797887081347,
                0.565685424949238,
                1.1313708498984762,
                0.46188021535170065,
                1.3856406460551018
            ],
            "w": [
                0.8,
                0.8717797887081347,
                1.1313708498984762,
                0.565685424949238,
                1.3856406460551018,
                0.46188021535170065
            ],
            "reg_layer": "conv48",
            "cls_layer": "conv47"
        },
        {
            "name": "bbox_decoder50",
            "h": [
                0.95,
                0.9746794344808963,
                0.67175144212722,
                1.3435028842544403,
                0.5484827557301445,
                1.6454482671904334
            ],
            "w": [
                0.95,
                0.9746794344808963,
                1.3435028842544403,
                0.67175144212722,
                1.6454482671904334,
                0.5484827557301445
            ],
            "reg_layer": "conv51",
            "cls_layer": "conv50"
        }
    ]
}

This results in a detection score of 0.99 or higher.

Your help would be appreciated.

Hey @Youngwook_Kwon ,

  • From The Config, These Issues Might Be Breaking Detection (and How to Fix Them)

1. Output scaling mismatch
Your current config has:

  • centers_scale_factor: 10
  • bbox_dimensions_scale_factor: 5

MobileNetV3-SSDLite usually doesn’t need those — the outputs are already normalized.

  • Set both scaling factors to 1.

2. Anchor box / aspect ratio mismatch
You manually set anchor sizes, We recommend matching your decoder anchors to the original torchvision config.


3. Background removal mismatch
You have:

  • "background_removal": true
  • "background_removal_index": 0

Correct if the background class is really 0 , Double-check your model config just in case.


4. NMS score threshold too low
Setting nms_scores_th=0.001 allows way too many low-confidence boxes to pass into NMS.

  • Increase it to 0.05 or 0.1.

5. Input normalization mismatch
Make sure you normalize images exactly like torchvision expects:

normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                  std=[0.229, 0.224, 0.225])

Also, make sure inputs are RGB, not BGR.


Here’s a Config You Should Try

{
    "nms_scores_th": 0.05,
    "nms_iou_th": 0.5,
    "max_proposals_per_class": 100,
    "image_dims": [320, 320],
    "centers_scale_factor": 1,
    "bbox_dimensions_scale_factor": 1,
    "classes": 91,
    "background_removal": true,
    "background_removal_index": 0,
    "bbox_decoders": [
        {
            "name": "bbox_decoder26",
            "h": [0.1, 0.2, 0.2, 0.1, 0.15, 0.15],
            "w": [0.1, 0.2, 0.2, 0.1, 0.15, 0.15],
            "reg_layer": "conv27",
            "cls_layer": "conv26"
        },
        {
            "name": "bbox_decoder35",
            "h": [0.2, 0.3, 0.3, 0.2, 0.25, 0.25],
            "w": [0.2, 0.3, 0.3, 0.2, 0.25, 0.25],
            "reg_layer": "conv36",
            "cls_layer": "conv35"
        },
        {
            "name": "bbox_decoder39",
            "h": [0.35, 0.45, 0.45, 0.35, 0.4, 0.4],
            "w": [0.35, 0.45, 0.45, 0.35, 0.4, 0.4],
            "reg_layer": "conv40",
            "cls_layer": "conv39"
        },
        {
            "name": "bbox_decoder43",
            "h": [0.5, 0.6, 0.6, 0.5, 0.55, 0.55],
            "w": [0.5, 0.6, 0.6, 0.5, 0.55, 0.55],
            "reg_layer": "conv44",
            "cls_layer": "conv43"
        },
        {
            "name": "bbox_decoder47",
            "h": [0.65, 0.75, 0.75, 0.65, 0.7, 0.7],
            "w": [0.65, 0.75, 0.75, 0.65, 0.7, 0.7],
            "reg_layer": "conv48",
            "cls_layer": "conv47"
        },
        {
            "name": "bbox_decoder50",
            "h": [0.8, 0.9, 0.9, 0.8, 0.85, 0.85],
            "w": [0.8, 0.9, 0.9, 0.8, 0.85, 0.85],
            "reg_layer": "conv51",
            "cls_layer": "conv50"
        }
    ]
}

Make sure:

  • Input normalization uses [0.485, 0.456, 0.406] mean and [0.229, 0.224, 0.225] std.
  • Image input is RGB.

If normalization or color is wrong, your detections will still be bad even if anchors and scaling are correct.