Dataflow Compiler errors converting deep and wide feed-forward network

Previously, I have tried to convert simple model to HEF. Now, I have added 2 more linear layers, here is the changes from previous code:

batch_size = 1
input_len = 1024
vocab_len = 256  # UTF-8 characters
embedding_len = 256
hidden_size = 256

torch.manual_seed(0)
model = nn.Sequential(
    nn.Linear(vocab_len, embedding_len, bias=False),
    nn.ReLU(),
    nn.Flatten(),
    nn.Linear(input_len * embedding_len, hidden_size, bias=False),
    nn.ReLU(),
    nn.Linear(hidden_size, hidden_size, bias=False),
    nn.ReLU(),
    nn.Linear(hidden_size, hidden_size, bias=False),
    nn.ReLU(),
    nn.Linear(hidden_size, vocab_len, bias=False),
)
...
hn, npz = runner.translate_onnx_model(
    "model.onnx",
    "network",
    start_node_names=["/0/MatMul"],
    end_node_names=["/9/MatMul"],
    net_input_shapes={"/0/MatMul": [batch_size, input_len, vocab_len]},
)

Previous model was just 4 MiB on the drive, and it was converting fine. Now, the model is just 270 MiB on the drive in the ONNX file format. Unfortunately, I am getting an error right after quantization step when compiling model to HEF format:

...
Calibration: 100%|██████████| 64/64 [00:10<00:00,  6.29entries/s]
[info] Statistics Collector is done (completion time is 00:00:11.52)
[info] Starting Fix zp_comp Encoding
[info] Fix zp_comp Encoding is done (completion time is 00:00:00.00)
[info] Matmul Equalization skipped
[info] No shifts available for layer network/conv1/conv_op, using max shift instead. delta=0.8993
[info] No shifts available for layer network/fc1/conv_op, using max shift instead. delta=2.8396
[info] No shifts available for layer network/fc1_d3/conv_op, using max shift instead. delta=1.3134
[info] No shifts available for layer network/fc1_d3/conv_op, using max shift instead. delta=0.6567
[info] No shifts available for layer network/fc1_d2/conv_op, using max shift instead. delta=0.6788
[info] No shifts available for layer network/fc1_d3/conv_op, using max shift instead. delta=0.3173
[info] No shifts available for layer network/fc1_d2/conv_op, using max shift instead. delta=0.3394
[info] No shifts available for layer network/fc1_d1/conv_op, using max shift instead. delta=0.2193
[info] No shifts available for layer network/fc1_d3/conv_op, using max shift instead. delta=0.2077
[info] No shifts available for layer network/fc1_d2/conv_op, using max shift instead. delta=0.2297
[info] No shifts available for layer network/fc1_d1/conv_op, using max shift instead. delta=0.1096
[info] No shifts available for layer network/fc1/conv_op, using max shift instead. delta=0.3141
[info] No shifts available for layer network/conv1/conv_op, using max shift instead. delta=0.8993
[info] No shifts available for layer network/fc1/conv_op, using max shift instead. delta=0.3141
[info] No shifts available for layer network/conv1/conv_op, using max shift instead. delta=0.8993
[info] No shifts available for layer network/fc1_d1/conv_op, using max shift instead. delta=0.1096
[info] No shifts available for layer network/fc1/conv_op, using max shift instead. delta=0.3141
[info] No shifts available for layer network/conv1/conv_op, using max shift instead. delta=0.8993
[info] No shifts available for layer network/fc1_d2/conv_op, using max shift instead. delta=0.2297
[info] No shifts available for layer network/fc1_d1/conv_op, using max shift instead. delta=0.1096
[info] No shifts available for layer network/fc1/conv_op, using max shift instead. delta=0.3141
[info] No shifts available for layer network/conv1/conv_op, using max shift instead. delta=0.8993
[info] No shifts available for layer network/fc1_d1/conv_op, using max shift instead. delta=0.1096
[info] No shifts available for layer network/fc1_d2/conv_op, using max shift instead. delta=0.2297
[info] No shifts available for layer network/fc1_d3/conv_op, using max shift instead. delta=0.2077
[info] Finetune encoding skipped
[info] Bias Correction skipped
[info] Adaround skipped
[info] Quantization-Aware Fine-Tuning skipped
[info] Layer Noise Analysis skipped
[info] Model Optimization is done
[info] To achieve optimal performance, set the compiler_optimization_level to "max" by adding performance_param(compiler_optimization_level=max) to the model script. Note that this may increase compilation time.
[info] Loading network parameters
[info] Starting Hailo allocation and compilation flow
[error] Mapping Failed (allocation time: 0s)

[error] Failed to produce compiled graph
Can't find mutual format for fc1_d3 -> ew_add1_ew_add_n_fc1
[error] BackendAllocatorException: Compilation failed: Can't find mutual format for fc1_d3 -> ew_add1_ew_add_n_fc1

The architecture is barely changed and Hailo Dataflow Compiler is already struggling to convert weights to HEF file, that is required to run on the Hailo-8 AI accelerator.

Aside that error, if I try to make vocab size to a values like vocab_len = 151936, I am getting different error than before, and the error is happening while I am trying to quantize weights:

...
[info] Starting Model Optimization
[warning] Reducing optimization level to 0 (the accuracy won't be optimized and compression won't be used) because there's no available GPU
[warning] Running model optimization with zero level of optimization is not recommended for production use and might lead to suboptimal accuracy results
[info] Model received quantization params from the hn
Traceback (most recent call last):
  File "/home/i/p/hailo-convert/main_converter.py", line 97, in <module>
    runner.optimize(None)
  File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_sdk_common/states/states.py", line 16, in wrapped_func
    return func(self, *args, **kwargs)
  File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_sdk_client/runner/client_runner.py", line 2093, in optimize
    self._optimize(calib_data, data_type=data_type, work_dir=work_dir)
  File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_sdk_common/states/states.py", line 16, in wrapped_func
    return func(self, *args, **kwargs)
  File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_sdk_client/runner/client_runner.py", line 1935, in _optimize
    self._sdk_backend.full_quantization(calib_data, data_type=data_type, work_dir=work_dir)
  File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_sdk_client/sdk_backend/sdk_backend.py", line 1045, in full_quantization
    self._full_acceleras_run(self.calibration_data, data_type)
  File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_sdk_client/sdk_backend/sdk_backend.py", line 1229, in _full_acceleras_run
    optimization_flow.run()
  File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_model_optimization/tools/orchestator.py", line 306, in wrapper
    return func(self, *args, **kwargs)
  File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_model_optimization/flows/optimization_flow.py", line 326, in run
    step_func()
  File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_model_optimization/tools/orchestator.py", line 250, in wrapped
    result = method(*args, **kwargs)
  File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_model_optimization/tools/subprocess_wrapper.py", line 123, in parent_wrapper
    self.build_model()
  File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_model_optimization/tools/orchestator.py", line 250, in wrapped
    result = method(*args, **kwargs)
  File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_model_optimization/flows/optimization_flow.py", line 241, in build_model
    model.build(shapes)
  File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_model_optimization/acceleras/utils/distributed_utils.py", line 122, in wrapper
    res = func(self, *args, **kwargs)
  File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_model_optimization/acceleras/model/hailo_model/hailo_model.py", line 1109, in build
    layer.build(layer.input_shapes)
  File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_model_optimization/acceleras/hailo_layers/base_hailo_layer.py", line 1519, in build
    self._verify_and_set_hn_io_shapes()
  File "/home/i/miniconda3/envs/hailo_convert/lib/python3.10/site-packages/hailo_model_optimization/acceleras/hailo_layers/base_hailo_layer.py", line 1625, in _verify_and_set_hn_io_shapes
    raise AccelerasValueError(
hailo_model_optimization.acceleras.utils.acceleras_exceptions.AccelerasValueError: Inference input shapes [[-1, 65536]] for layer network/fc1 does not match HN shapes ListWrapper([ListWrapper([-1, 64])])

So, here is the full script to reproduce the errors:

import torch.nn as nn
import torch.utils.data
import hailo_sdk_client
from hailo_sdk_client import ClientRunner

print(f'Hailo Dataflow Compiler v{hailo_sdk_client.__version__}')

batch_size = 1
# input_len = 15  # Just a random number
# input_len = 32768  # https://huggingface.co/Xenova/Qwen1.5-0.5B/blob/main/config.json
# input_len = 4096
input_len = 1024
vocab_len = 256  # UTF-8 characters
# vocab_len = 151936  # https://huggingface.co/Xenova/Qwen1.5-0.5B/blob/main/config.json
embedding_len = 256
hidden_size = 256
# hidden_size = 512

torch.manual_seed(0)
# Note: Embedding layers should be changed to Linear layers, see https://community.hailo.ai/t/unable-to-convert-simplest-pytorch-model/3713/3
model = nn.Sequential(
    nn.Linear(vocab_len, embedding_len, bias=False),
    nn.ReLU(),
    nn.Flatten(),
    nn.Linear(input_len * embedding_len, hidden_size, bias=False),
    nn.ReLU(),
    nn.Linear(hidden_size, hidden_size, bias=False),
    nn.ReLU(),
    nn.Linear(hidden_size, hidden_size, bias=False),
    nn.ReLU(),
    nn.Linear(hidden_size, vocab_len, bias=False),
)

# Print parameters per layer
for i, layer in enumerate(model):
    print(f"Layer {i}: {layer}")

# Total number of parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params} (billions: {total_params / 1e9})")

# Create one-hot input instead of embedding indices
input_data = torch.zeros(batch_size, input_len, vocab_len)
dummy_input = torch.randint(vocab_len, (batch_size, input_len))
for i in range(batch_size):
    for j in range(input_len):
        input_data[i, j, dummy_input[i, j]] = 1  # One-hot encoding

output = model(input_data)
print(f"{output.mean()=}, {output.std(unbiased=False)=}, {output.shape=}")

with torch.no_grad():
    # torch.onnx.export(model, input_data, "model.onnx", verbose=True, input_names=["input"], output_names=["output"])
    torch.onnx.export(model, input_data, "model.onnx", verbose=True)

# chosen_hw_arch = "hailo8"
# chosen_hw_arch = "hailo15h"  # For Hailo-15 devices
chosen_hw_arch = "hailo8r"  # For Mini PCIe modules or Hailo-8R devices
runner = ClientRunner(hw_arch=chosen_hw_arch)
hn, npz = runner.translate_onnx_model(
    "model.onnx",
    "network",
    start_node_names=["/0/MatMul"],
    end_node_names=["/9/MatMul"],
    net_input_shapes={"/0/MatMul": [batch_size, input_len, vocab_len]},
)
runner.save_har("model.har")

runner.optimize(None)

hef = runner.compile()
with open("model.hef", "wb") as f:
    f.write(hef)

Hey @ivanstepanovftw

Thanks for the detailed information and for sharing the specific errors you’re seeing. Based on the issues you’re facing, it seems like the complexity of the model is overwhelming the Hailo Dataflow Compiler, especially in terms of quantization and tensor shape handling. Here’s a more focused guide on how to resolve these problems:

1. Dealing with Large Model Size and Complexity:

The sharp increase in the ONNX model size (from 4 MB to 270 MB) and the errors you’re seeing suggest that the additional layers and increased dimensions are making the model difficult to handle during compilation. To resolve this:

Fix:

  • Reduce model complexity: Temporarily reduce the dimensions of the input, vocab, or hidden layers. For example:
    input_len = 512  # Reduce from 1024
    hidden_size = 128  # Reduce from 256
    
  • This simplification will help you isolate whether the problem is with the model’s size and complexity or the DFC itself.

Further Suggestion: Once the simplified model compiles successfully, you can incrementally increase the sizes and layer complexity while monitoring which changes reintroduce the issue. This will give you a clear indication of where the bottleneck is.

2. Resolving “No Shifts Available” Error:

The No shifts available for layer ... using max shift message usually means that the model’s quantization parameters or the layer configuration is too difficult for the compiler to map correctly. This often happens when certain tensor shapes are unexpected or the values used in operations (like matrix multiplications) are too large or small.

Fix:

  • Ensure tensor dimensions are reasonable: Reduce the dimensionality of the layers to prevent overwhelming the compiler during quantization.
  • Check for irregular dimensions: Make sure that the matrix multiplications and operations in your model have expected shapes. You may want to experiment with reducing the dimensionality of the input or intermediate layers.
  • Use the recommended optimization level: Consider increasing the optimization level in your script as recommended by the compiler logs:
    performance_param(compiler_optimization_level="max")
    
    This can help the DFC handle more complex models by increasing the depth of optimization performed by the compiler.

3. Resolving “Mutual Format” Error:

The error:

Can't find mutual format for fc1_d3 -> ew_add1_ew_add_n_fc1

suggests that there is a tensor format mismatch between these two layers, which often happens during quantization or weight conversion.

Fix:

  • Manual Shape Checking: Double-check the shapes being passed between the fc1_d3 and ew_add1_ew_add_n_fc1 layers. Mismatched dimensions between layers can cause format incompatibilities.
  • Explicitly reshape tensors: If there are any layers with mismatched dimensions, use nn.Flatten() or nn.Reshape() to ensure that tensor dimensions match between connected layers. For example:
    nn.Flatten()  # Ensure all layers pass the correct dimensionality
    
  • Check if specific layer types are unsupported: Certain layers or operations may be unsupported or suboptimally handled by the Hailo compiler. Consider replacing certain layers with simpler alternatives to see if the issue persists.

4. Resolving Quantization Errors:

The quantization error:

Inference input shapes [[-1, 65536]] for layer network/fc1 does not match HN shapes ListWrapper([ListWrapper([-1, 64])])

is likely caused by incompatible input and output tensor shapes during the quantization step. This can happen when the expected shapes during inference don’t match the input shapes defined in the model.

Fix:

  • Verify input-output shapes: Carefully check the shapes of the input tensors going into the model. Make sure that each layer is producing an output tensor shape that the following layer can accept. You can print the shapes of tensors at each layer to verify this:
    print(f"Layer output shape: {output_tensor.shape}")
    
  • Adjust tensor shapes: Use reshape operations or linear transformations to ensure that the output tensor shapes match the expected input shapes for the next layer. For example:
    nn.Linear(input_size, output_size)
    

5. Handling Large Vocab Sizes and Embedding Layers:

The larger vocab size (151936) is likely causing the compiler to fail due to the increased memory and computational requirements. Hailo devices have specific memory limitations, and a large vocab size or embedding dimension can easily overwhelm the system.

Fix:

  • Reduce the vocab size: Try using a smaller vocab size and embedding dimension. For example:
    vocab_len = 1024  # Reduce from 151936
    embedding_len = 128  # Reduce from 256
    
  • Split the model: If your model is too large, you can consider splitting the network across multiple Hailo devices or reducing the size of the input. For edge devices, it’s important to design models that are memory-efficient.

6. Embedding Layer Replacement:

Since Hailo does not natively support embedding layers, your approach of replacing them with Linear layers is correct. Just make sure that your one-hot encoding input is being handled correctly, as this is essential for the linear transformation to work as expected.

7. Final Suggestions:

  • Quantization-Aware Training (QAT): If quantization issues persist, consider using QAT before the model conversion. This helps ensure that the model is more quantization-friendly, improving accuracy and making the DFC less likely to fail during conversion.
  • Use Profiling Tools: Hailo provides profiling tools in its SDK to help analyze which layers or operations are causing bottlenecks. Running your model through these tools can help pinpoint where the issues lie and give you insights into optimization.

I hope these suggestions help resolve the issues. Let me know how things go or if you need further assistance!

Best regards,
Omri