Translating LLM (llama 3 8b) Fails

Hi, I’m managed to convert the llama 3 8b model into ONNX format.
But trying to convert the Model into Hailo’s Format returns the following error:
UnsupportedModelError: Unexpected zero dimension in shape [-1, 0] at input layer layer (translated from input_ids)

I have debuged this day and night without sleep for 3 days. It would be really great if someone could help me, thanks.

Here’s the terminal output:

[info] Translation started on ONNX model model
[warning] Large model detected. The graph may contain either a large number of operators, or weight variables with a very large capacity.
[warning] Translation time may be a bit long, and some features may be disabled (e.g. model augmentation, retry simplified model, onnx runtime hailo model extraction, etc.).
[info] Restored ONNX model model (completion time: 00:00:16.69)
[warning] ONNX shape inference failed: Unsupported dynamic shape([0, 0]) found on input node input_ids. Please use net_input_shapes, see documentation for additional info.

UnsupportedModelError Traceback (most recent call last)
Cell In[2], line 1
----> 1 client_runner.translate_onnx_model(“./M1-8B-v0.1-ONNX/model.onnx”)

File /opt/conda/lib/python3.10/site-packages/hailo_sdk_common/states/states.py:16, in allowed_states..wrap..wrapped_func(self, *args, **kwargs)
12 if self._state not in states:
13 raise InvalidStateException(
14 f"The execution of {func.name} is not available under the state: {self._state.value}",
15 )
—> 16 return func(self, *args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/hailo_sdk_client/runner/client_runner.py:1158, in ClientRunner.translate_onnx_model(self, model, net_name, start_node_names, end_node_names, net_input_shapes, augmented_path, disable_shape_inference, disable_rt_metadata_extraction, net_input_format, **kwargs)
1115 “”"
1116 DFC API for parsing an ONNX model. This creates a runner with loaded HN (model) and
1117 parameters.
(…)
1155
1156 “”"
1157 parser = Parser()
→ 1158 parser.translate_onnx_model(
1159 model=model,
1160 net_name=net_name,
1161 start_node_names=start_node_names,
1162 end_node_names=end_node_names,
1163 net_input_shapes=net_input_shapes,
1164 augmented_path=augmented_path,
1165 disable_shape_inference=disable_shape_inference,
1166 disable_rt_metadata_extraction=disable_rt_metadata_extraction,
1167 net_input_format=net_input_format,
1168 **kwargs,
1169 )
1171 return self._finalize_parsing(parser.return_data)

File /opt/conda/lib/python3.10/site-packages/hailo_sdk_client/sdk_backend/parser/parser.py:232, in Parser.translate_onnx_model(self, model, net_name, start_node_names, end_node_names, net_input_shapes, augmented_path, disable_shape_inference, disable_rt_metadata_extraction, net_input_format, **kwargs)
230 except Exception as e:
231 if large_model_detected or long_model_detected:
→ 232 raise e from None
234 try:
235 simplified_model, is_valid = onnxsim.simplify(onnx_model, skip_fuse_bn=True)

File /opt/conda/lib/python3.10/site-packages/hailo_sdk_client/sdk_backend/parser/parser.py:220, in Parser.translate_onnx_model(self, model, net_name, start_node_names, end_node_names, net_input_shapes, augmented_path, disable_shape_inference, disable_rt_metadata_extraction, net_input_format, **kwargs)
217 onnx.save_model(onnx_model, augmented_path)
219 try:
→ 220 parsing_results = self._parse_onnx_model_to_hn(
221 onnx_model=onnx_model,
222 net_name=valid_net_name,
223 start_node_names=start_node_names,
224 end_node_names=end_node_names,
225 net_input_shapes=net_input_shapes,
226 disable_shape_inference=disable_shape_inference,
227 net_input_format=net_input_format,
228 )
230 except Exception as e:
231 if large_model_detected or long_model_detected:

File /opt/conda/lib/python3.10/site-packages/hailo_sdk_client/sdk_backend/parser/parser.py:300, in Parser._parse_onnx_model_to_hn(self, onnx_model, net_name, start_node_names, end_node_names, net_input_shapes, disable_shape_inference, net_input_format, **kwargs)
297 except Exception as e:
298 self._logger.warning(f"ONNX shape inference failed: {e!s}")
→ 300 return self.parse_model_to_hn(
301 onnx_model,
302 None,
303 net_name,
304 start_node_names,
305 end_node_names,
306 nn_framework=NNFramework.ONNX,
307 output_shapes=output_shapes,
308 net_input_format=net_input_format,
309 **kwargs,
310 )

File /opt/conda/lib/python3.10/site-packages/hailo_sdk_client/sdk_backend/parser/parser.py:351, in Parser.parse_model_to_hn(self, model, values, net_name, start_node_names, end_node_names, nn_framework, output_shapes, net_input_format, rename_layers_by_blocks)
348 else:
349 raise BackendRuntimeException(f"Unsupported NN framework {nn_framework}")
→ 351 fuser = HailoNNFuser(converter.convert_model(), net_name, converter.end_node_names)
352 hailo_nn = fuser.convert_model()
353 hailo_nn.validate_stage(HnStage.HN)

File /opt/conda/lib/python3.10/site-packages/hailo_sdk_client/model_translator/translator.py:79, in HailoNNConverter.convert_model(self)
77 self._validate_model_params()
78 self._validate_bn_ops_in_training()
—> 79 self._create_layers()
80 self._add_layers_connections()
81 self._layers_graph.set_names_and_indices()

File /opt/conda/lib/python3.10/site-packages/hailo_sdk_client/model_translator/edge_nn_translator.py:32, in EdgeNNConverter._create_layers(self)
30 def _create_layers(self):
31 self._visited_states = {}
—> 32 self._add_input_layers()
33 self._update_vertices_info()
34 self._add_direct_layers()

File /opt/conda/lib/python3.10/site-packages/hailo_sdk_client/model_translator/edge_nn_translator.py:70, in EdgeNNConverter._add_input_layers(self)
65 if rank not in [2, 3, 4]:
66 raise UnsupportedModelError(
67 f"Input layer {vertex.name} has an input tensor with {rank} dimensions, which is not supported "
68 “by the Dataflow Compiler. Only 2-4 dimensional tensors are allowed”,
69 )
—> 70 layer = InputLayer.create(vertex.name, input_shapes)
71 self._add_layer(layer, has_edge=False)
72 self._vertices_to_layers[vertex] = layer

File /opt/conda/lib/python3.10/site-packages/hailo_sdk_common/hailo_nn/hn_layers/io_layers.py:110, in InputLayer.create(cls, original_name, output_shapes)
108 layer = cls()
109 layer.add_original_name(original_name)
→ 110 layer.output_shapes = output_shapes
111 for shape in layer.output_shapes:
112 shape[0] = -1

File /opt/conda/lib/python3.10/site-packages/hailo_sdk_common/hailo_nn/hn_layers/layer.py:487, in Layer.output_shapes(self, output_shapes)
485 self._output_shapes =
486 for shape in output_shapes:
→ 487 self._append_output_shape(shape)

File /opt/conda/lib/python3.10/site-packages/hailo_sdk_common/hailo_nn/hn_layers/layer.py:461, in Layer._append_output_shape(self, output_shape)
460 def _append_output_shape(self, output_shape):
→ 461 self._check_valid_shape(output_shape)
462 self.append_output_shape(output_shape)

File /opt/conda/lib/python3.10/site-packages/hailo_sdk_common/hailo_nn/hn_layers/layer.py:454, in Layer._check_valid_shape(self, shape)
449 raise UnsupportedModelError(
450 f"Unexpected dimension in shape {shape} at {self.full_name_msg}. "
451 f"Dimension must be of type ‘int’ or ‘long’“,
452 )
453 if any(dim == 0 for dim in shape):
→ 454 raise UnsupportedModelError(f"Unexpected zero dimension in shape {shape} at {self.full_name_msg}”)

UnsupportedModelError: Unexpected zero dimension in shape [-1, 0] at input layer layer (translated from input_ids)

Hi @orionriker,
Thank you for using Hail community forum. And we’re super appreciative of your diligent try to make this work. At the moment LLMs are not supported by Hailo-8 product line.

Understood, but neural networks are neural networks. There must be a way to work around this, right?

Technically you are right, but with LLM the challenge is different. With traditional CNNs there is much more compute compared to weights. With LLM it’s all around the ability to bring the weights from memory. See comparison below:


So, while technically true, we don’t think that the performance would be attractive enough. This is the reason that we are working on an alternative HW.

Hmm. that is true. bringing so much weights from memory is definitely a big task. Now I understand what you mean. But the thing is most people using Edge devices like the raspberry pi are already thinkering like me to get to run AI/LLM models faster, but adapting the Hailo 8 you get alot of TOPS. and for me any performance boost is really great. even if it’s 10-15 tokens per second.

1 Like

Also is it possible for you explain the error that I am facing? I know it has to do something with the dynamic layers inside the LLMs but couldn’t really grasp the error that much.

You’re right, and we’re also completely aware of this. We are working on a solution that would be attractive for makers.

Beyond the shear size of the model, the basic op - MHSA has some unique features compared to CNNs. It mighe be that the attention machanims looks different than what the SW knows how to parse, but in oder to be certain, we would need to look at the ONNX.

It’s good hear that! Thanks. For the mean time let me share what I did to convert the LLM to ONNX so that you guys can take a look at the onnx model:

I first researched about converting LLMs to ONNX, found some really great info.
Please note to convert the model I had to a server with 24gb vram GPU from vast.ai
It could not for some reason run on my GPU with 16gb vram, even tho it runs the model just fine under 12gb vram.

Then I installed Optimum which allows to you to convert pytorch models to ONNX including LLMs with the following command:

pip install optimum[exporters]

Cloned the model (I have custom trained based on llama instuct 8b to handle function calling tasks) from huggingface using git with git-lfs:

Install git-lfs:

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs
git-lfs install

Clone the model:

git clone https://huggingface.co/orionriker/M1-8B-v0.1-BF16

And then finally ran the following command to convert it to onnx:

!optimum-cli export onnx --task text-generation-with-past --model "./M1-8B-v0.1-BF16" --cache_dir "./optimum-cache" --trust-remote-code --no-constant-folding --opset 17 --fp16 --optimize "O3" --device "cuda" --batch_size 8 --sequence_length 128 "./M1-8B-v0.1-ONNX"