Previously, I have tried to convert simple model to HEF. Now, I have added 2 more linear layers, here is the changes from previous code:
batch_size = 1
input_len = 1024
vocab_len = 256 # UTF-8 characters
embedding_len = 256
hidden_size = 256
torch.manual_seed(0)
model = nn.Sequential(
nn.Linear(vocab_len, embedding_len, bias=False),
nn.ReLU(),
nn.Flatten(),
nn.Linear(input_len * embedding_len, hidden_size, bias=False),
nn.ReLU(),
nn.Linear(hidden_size, hidden_size, bias=False),
nn.ReLU(),
nn.Linear(hidden_size, hidden_size, bias=False),
nn.ReLU(),
nn.Linear(hidden_size, vocab_len, bias=False),
)
...
hn, npz = runner.translate_onnx_model(
"model.onnx",
"network",
start_node_names=["/0/MatMul"],
end_node_names=["/9/MatMul"],
net_input_shapes={"/0/MatMul": [batch_size, input_len, vocab_len]},
)
Previous model was just 4 MiB on the drive, and it was converting fine. Now, the model is just 270 MiB on the drive in the ONNX file format. Unfortunately, I am getting an error right after quantization step when compiling model to HEF format:
...
Calibration: 100%|██████████| 64/64 [00:10<00:00, 6.29entries/s]
[info] Statistics Collector is done (completion time is 00:00:11.52)
[info] Starting Fix zp_comp Encoding
[info] Fix zp_comp Encoding is done (completion time is 00:00:00.00)
[info] Matmul Equalization skipped
[info] No shifts available for layer network/conv1/conv_op, using max shift instead. delta=0.8993
[info] No shifts available for layer network/fc1/conv_op, using max shift instead. delta=2.8396
[info] No shifts available for layer network/fc1_d3/conv_op, using max shift instead. delta=1.3134
[info] No shifts available for layer network/fc1_d3/conv_op, using max shift instead. delta=0.6567
[info] No shifts available for layer network/fc1_d2/conv_op, using max shift instead. delta=0.6788
[info] No shifts available for layer network/fc1_d3/conv_op, using max shift instead. delta=0.3173
[info] No shifts available for layer network/fc1_d2/conv_op, using max shift instead. delta=0.3394
[info] No shifts available for layer network/fc1_d1/conv_op, using max shift instead. delta=0.2193
[info] No shifts available for layer network/fc1_d3/conv_op, using max shift instead. delta=0.2077
[info] No shifts available for layer network/fc1_d2/conv_op, using max shift instead. delta=0.2297
[info] No shifts available for layer network/fc1_d1/conv_op, using max shift instead. delta=0.1096
[info] No shifts available for layer network/fc1/conv_op, using max shift instead. delta=0.3141
[info] No shifts available for layer network/conv1/conv_op, using max shift instead. delta=0.8993
[info] No shifts available for layer network/fc1/conv_op, using max shift instead. delta=0.3141
[info] No shifts available for layer network/conv1/conv_op, using max shift instead. delta=0.8993
[info] No shifts available for layer network/fc1_d1/conv_op, using max shift instead. delta=0.1096
[info] No shifts available for layer network/fc1/conv_op, using max shift instead. delta=0.3141
[info] No shifts available for layer network/conv1/conv_op, using max shift instead. delta=0.8993
[info] No shifts available for layer network/fc1_d2/conv_op, using max shift instead. delta=0.2297
[info] No shifts available for layer network/fc1_d1/conv_op, using max shift instead. delta=0.1096
[info] No shifts available for layer network/fc1/conv_op, using max shift instead. delta=0.3141
[info] No shifts available for layer network/conv1/conv_op, using max shift instead. delta=0.8993
[info] No shifts available for layer network/fc1_d1/conv_op, using max shift instead. delta=0.1096
[info] No shifts available for layer network/fc1_d2/conv_op, using max shift instead. delta=0.2297
[info] No shifts available for layer network/fc1_d3/conv_op, using max shift instead. delta=0.2077
[info] Finetune encoding skipped
[info] Bias Correction skipped
[info] Adaround skipped
[info] Quantization-Aware Fine-Tuning skipped
[info] Layer Noise Analysis skipped
[info] Model Optimization is done
[info] To achieve optimal performance, set the compiler_optimization_level to "max" by adding performance_param(compiler_optimization_level=max) to the model script. Note that this may increase compilation time.
[info] Loading network parameters
[info] Starting Hailo allocation and compilation flow
[error] Mapping Failed (allocation time: 0s)
[error] Failed to produce compiled graph
Can't find mutual format for fc1_d3 -> ew_add1_ew_add_n_fc1
[error] BackendAllocatorException: Compilation failed: Can't find mutual format for fc1_d3 -> ew_add1_ew_add_n_fc1
The architecture is barely changed and Hailo Dataflow Compiler is already struggling to convert weights to HEF file, that is required to run on the Hailo-8 AI accelerator.
Aside that error, if I try to make vocab size to a values like vocab_len = 151936
, I am getting different error than before, and the error is happening while I am trying to quantize weights:
...
[info] Starting Model Optimization
[warning] Reducing optimization level to 0 (the accuracy won't be optimized and compression won't be used) because there's no available GPU
[warning] Running model optimization with zero level of optimization is not recommended for production use and might lead to suboptimal accuracy results
[info] Model received quantization params from the hn
Traceback (most recent call last):
File "/home/i/p/hailo-convert/main_converter.py", line 97, in <module>
runner.optimize(None)
File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_sdk_common/states/states.py", line 16, in wrapped_func
return func(self, *args, **kwargs)
File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_sdk_client/runner/client_runner.py", line 2093, in optimize
self._optimize(calib_data, data_type=data_type, work_dir=work_dir)
File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_sdk_common/states/states.py", line 16, in wrapped_func
return func(self, *args, **kwargs)
File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_sdk_client/runner/client_runner.py", line 1935, in _optimize
self._sdk_backend.full_quantization(calib_data, data_type=data_type, work_dir=work_dir)
File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_sdk_client/sdk_backend/sdk_backend.py", line 1045, in full_quantization
self._full_acceleras_run(self.calibration_data, data_type)
File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_sdk_client/sdk_backend/sdk_backend.py", line 1229, in _full_acceleras_run
optimization_flow.run()
File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_model_optimization/tools/orchestator.py", line 306, in wrapper
return func(self, *args, **kwargs)
File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_model_optimization/flows/optimization_flow.py", line 326, in run
step_func()
File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_model_optimization/tools/orchestator.py", line 250, in wrapped
result = method(*args, **kwargs)
File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_model_optimization/tools/subprocess_wrapper.py", line 123, in parent_wrapper
self.build_model()
File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_model_optimization/tools/orchestator.py", line 250, in wrapped
result = method(*args, **kwargs)
File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_model_optimization/flows/optimization_flow.py", line 241, in build_model
model.build(shapes)
File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_model_optimization/acceleras/utils/distributed_utils.py", line 122, in wrapper
res = func(self, *args, **kwargs)
File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_model_optimization/acceleras/model/hailo_model/hailo_model.py", line 1109, in build
layer.build(layer.input_shapes)
File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_model_optimization/acceleras/hailo_layers/base_hailo_layer.py", line 1519, in build
self._verify_and_set_hn_io_shapes()
File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_model_optimization/acceleras/hailo_layers/base_hailo_layer.py", line 1625, in _verify_and_set_hn_io_shapes
raise AccelerasValueError(
hailo_model_optimization.acceleras.utils.acceleras_exceptions.AccelerasValueError: Inference input shapes [[-1, 65536]] for layer network/fc1 does not match HN shapes ListWrapper([ListWrapper([-1, 64])])
So, here is the full script to reproduce the errors:
import torch.nn as nn
import torch.utils.data
import hailo_sdk_client
from hailo_sdk_client import ClientRunner
print(f'Hailo Dataflow Compiler v{hailo_sdk_client.__version__}')
batch_size = 1
# input_len = 15 # Just a random number
# input_len = 32768 # https://huggingface.co/Xenova/Qwen1.5-0.5B/blob/main/config.json
# input_len = 4096
input_len = 1024
vocab_len = 256 # UTF-8 characters
# vocab_len = 151936 # https://huggingface.co/Xenova/Qwen1.5-0.5B/blob/main/config.json
embedding_len = 256
hidden_size = 256
# hidden_size = 512
torch.manual_seed(0)
# Note: Embedding layers should be changed to Linear layers, see https://community.hailo.ai/t/unable-to-convert-simplest-pytorch-model/3713/3
model = nn.Sequential(
nn.Linear(vocab_len, embedding_len, bias=False),
nn.ReLU(),
nn.Flatten(),
nn.Linear(input_len * embedding_len, hidden_size, bias=False),
nn.ReLU(),
nn.Linear(hidden_size, hidden_size, bias=False),
nn.ReLU(),
nn.Linear(hidden_size, hidden_size, bias=False),
nn.ReLU(),
nn.Linear(hidden_size, vocab_len, bias=False),
)
# Print parameters per layer
for i, layer in enumerate(model):
print(f"Layer {i}: {layer}")
# Total number of parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params} (billions: {total_params / 1e9})")
# Create one-hot input instead of embedding indices
input_data = torch.zeros(batch_size, input_len, vocab_len)
dummy_input = torch.randint(vocab_len, (batch_size, input_len))
for i in range(batch_size):
for j in range(input_len):
input_data[i, j, dummy_input[i, j]] = 1 # One-hot encoding
output = model(input_data)
print(f"{output.mean()=}, {output.std(unbiased=False)=}, {output.shape=}")
with torch.no_grad():
# torch.onnx.export(model, input_data, "model.onnx", verbose=True, input_names=["input"], output_names=["output"])
torch.onnx.export(model, input_data, "model.onnx", verbose=True)
# chosen_hw_arch = "hailo8"
# chosen_hw_arch = "hailo15h" # For Hailo-15 devices
chosen_hw_arch = "hailo8r" # For Mini PCIe modules or Hailo-8R devices
runner = ClientRunner(hw_arch=chosen_hw_arch)
hn, npz = runner.translate_onnx_model(
"model.onnx",
"network",
start_node_names=["/0/MatMul"],
end_node_names=["/9/MatMul"],
net_input_shapes={"/0/MatMul": [batch_size, input_len, vocab_len]},
)
runner.save_har("model.har")
runner.optimize(None)
hef = runner.compile()
with open("model.hef", "wb") as f:
f.write(hef)