ONNX parser maps 5D input [1,6,3,256,704] to [256,704,3] in HAR conversion

Hello,

I am trying to parse my ONNX model with Hailo parser, but I encountered an issue with input shape mapping.
My ONNX model has the following graph input:

image: [1,6,3,256,704]

This represents:

  • N = 1 (batch)

  • V = 6 (camera views)

  • C = 3 (channels)

  • H = 256

  • W = 704

However, when I run the parser, the HAR mapping changes this input to [256,704,3].
It seems that the parser ignores the leading dimensions (1,6,3) and only keeps the last three dimensions as if it were an NHWC input.

Here is the command and log output I used:

hailo parser onnx fastbev_pre_trt.onnx --har-path ./har_combine/bev_feature.har
–start-node-names Reshape_0 --tensor-shapes Reshape_0=[6,3,256,704]

[info] Translation started on ONNX model fastbev_pre_trt
[info] Restored ONNX model fastbev_pre_trt (completion time: 00:00:00.20)
[info] Extracted ONNXRuntime meta-data for Hailo model (completion time: 00:00:00.68)
[info] Start nodes mapped from original model: ‘image’: ‘fastbev_pre_trt/input_layer1’.
[info] End nodes mapped from original model: ‘Conv_137’.
[info] Translation completed on ONNX model fastbev_pre_trt (completion time: 00:00:01.63)

hailo_model_optimization.acceleras.utils.acceleras_exceptions.BadInputsShape:
Data shape (6, 256, 704, 3) for layer fastbev_pre_trt/input_layer1
doesn’t match network’s input shape (256, 704, 3)

So although my ONNX clearly specifies [1,6,3,256,704], the parser maps the input to [256,704,3] in the HAR.
This causes shape mismatch when I try to run optimization with calibration data in [N,6,3,256,704] format.

My questions

  1. Does the current Hailo parser support 5D inputs (e.g., [1,6,3,256,704]) as ONNX graph inputs?

  2. If not, what is the recommended way to preserve the 6-view dimension without unfolding it into batch (since batching increases latency significantly in my application)?

  3. Is there a known workaround (e.g., channel merging, custom reshape at the graph input) to force the parser to keep [1,6,3,256,704]?

Thank you.

Hey @Donghyeok_Min,

I see you’ve hit a wall with the Hailo parser - unfortunately it doesn’t handle 5D ONNX inputs like [N, V, C, H, W] right now. The parser is built for 4D image tensors ([N, C, H, W] or [N, H, W, C]) and just drops any extra dimensions it finds, which explains why your [1,6,3,256,704] input got squeezed down to [256,704,3].

Here’s what I can tell you:

1. Can the parser handle 5D inputs? Nope, not currently. The DFC is limited to 4D input tensors (batch plus image dimensions). Any additional axes like “views” or sequence dimensions get lost at the graph input level.

2. How can you keep that 6-view dimension without batching? Since true 5D support isn’t there, you’ll need to fold that view dimension into one of the existing axes before parsing.

3. Any workarounds available? I’d suggest a couple approaches:

  • Channel merging: Reshape your [1,6,3,256,704] to [1,18,256,704] by treating those 6 views as additional channels. This should keep your latency pretty close to the original.
  • Spatial merging (though less common): You could reshape into a larger spatial grid and split it back inside the model later.

Either way, you’ll need to add a Reshape or Transpose operation before the ONNX graph input so the parser only sees a standard 4D tensor.

Hope this helps clarify things!

Thanks for the clarification regarding 5D inputs. That makes sense.

So, I tried 4D input on my model.

image: [6, 3, 256, 704]

This should be a standard NCHW format (N=6, C=3, H=256, W=704).
But when I parse the model with Hailo parser, it automatically remaps the input to [256,704,3] in the HAR, as if it were treating the input as NHWC with N=1. This causes the same mismatch error as before:

BadInputsShape: Data shape (6,256,704,3) for layer fastbev_pre_trt/input_layer1
doesn't match network's input shape (256,704,3)

So my questions are:

  1. If the ONNX input is strictly 4D [6,3,256,704], why does the parser still squeeze it into [256,704,3]?

  2. Does the parser only support 4D with N=1 (batch=1) and drop the batch dimension entirely?

  3. If so, is there a way to force the parser to respect N>1 for batch (e.g. N=6 in my case), instead of collapsing it?

  4. Or is the only viable solution again to fold the 6 into the channel dimension and make it [1,18,256,704] before parsing?

This behavior seems to happen even when the model is already in valid 4D NCHW format, so I’d like to understand if this is an intentional limitation of the parser or a bug.

Thanks again for your help @omria !

It looks like this is probably an inherent limitation.

Take a look at this video classification model from the official model zoo. You would expect that the input would be B, T, C, H, W (T is temporal dimension). But if you look at the input to the model the temporal dimension is folded into the channel dimension (16 frames x 3 channels = 48).