YoloV9-tiny compilation fails

Hello,

for the conversion, I tried both DFC 3.28 and DFC 2.27. DFC 3.28 appears to have some shape issue in the optimization currently, so I will only show the results for DFC 3.27.

I used the weights from here: WongKinYiu/Readme/Performance/YOLOv9-T

I exported it using the command python export.py --include onnx --weights yolov9-t-converted.pt --imgsz 640 640 --simplify --optimize
(I can also make the model accesible)

Note that this onnx file is slightly different from the yolov9c in the model-zoo, as it was optimized to remove an auxiliary branch that’s only required for training.

Left: YoloV9c from hailo model zoo, right: YoloV9-T

I also have a requirements.txt for the used environment with pip, that I can upload if someone provides me an upload space.

Once a .onnx file as been generated, I use these commands for conversion:

hailo parser onnx yolov9-t-converted.onnx --net-name yolov9-t --har-path yolov9-t.har --start-node-names images --end-node-names output0 --hw-arch hailo8 --augmented-path yolov9-t-augmented.onnx

These are the respective logs which look fine to me:
Note that some auto-fixed issue with the output node appeared to arise, and I dont need NMS right now, so I skipped that.

(...)
[info] System info: OS: Linux, Kernel: 6.5.0-41-generic
[info] Hailo DFC Version: 3.27.0
[info] HailoRT Version: Not Installed
[info] PCIe: b5:00.0: Number Of Lanes: 4, Speed: 8.0 GT/s PCIe
[info] PCIe: b6:00.0: Number Of Lanes: 4, Speed: 8.0 GT/s PCIe
[info] PCIe: b7:00.0: Number Of Lanes: 4, Speed: 8.0 GT/s PCIe
[info] PCIe: b8:00.0: Number Of Lanes: 4, Speed: 8.0 GT/s PCIe
[info] Running `hailo parser onnx yolov9-t-converted.onnx --net-name yolov9-t --har-path yolov9-t.har --start-node-names images --end-node-names output0 --hw-arch hailo8 --augmented-path yolov9-t-augmented.onnx`
[info] Translation started on ONNX model yolov9-t
[info] Restored ONNX model yolov9-t (completion time: 00:00:00.05)
[info] Extracted ONNXRuntime meta-data for Hailo model (completion time: 00:00:00.20)
[info] Saving a modified model, augmented with tensors names (where applicable). New file path is at yolov9-t-augmented.onnx
[info] Saving a simplified model, augmented with tensors names (where applicable). New file path is at yolov9-t-augmented.sim.onnx
[info] Simplified ONNX model for a parsing retry attempt (completion time: 00:00:02.13)
Parsing failed with recommendations for end node names: ['/model.22/Concat_3'].
Would you like to parse again with the recommendation? (y/n)
y
[info] According to recommendations, retrying parsing with end node names: ['/model.22/Concat_3'].
[info] Translation started on ONNX model yolov9-t
[info] Restored ONNX model yolov9-t (completion time: 00:00:00.04)
[info] Extracted ONNXRuntime meta-data for Hailo model (completion time: 00:00:00.20)
[info] Saving a modified model, augmented with tensors names (where applicable). New file path is at yolov9-t-augmented.onnx
[info] NMS structure of yolov8 (or equivalent architecture) was detected.
[info] In order to use HailoRT post-processing capabilities, these end node names should be used: /model.22/cv2.0/cv2.0.2/Conv /model.22/cv3.0/cv3.0.2/Conv /model.22/cv2.1/cv2.1.2/Conv /model.22/cv3.1/cv3.1.2/Conv /model.22/cv2.2/cv2.2.2/Conv /model.22/cv3.2/cv3.2.2/Conv.
[info] Start nodes mapped from original model: 'images': 'yolov9-t/input_layer1'.
[info] End nodes mapped from original model: '/model.22/Concat_3'.
[info] Translation completed on ONNX model yolov9-t (completion time: 00:00:02.79)
Would you like to parse the model again with the mentioned end nodes and add nms postprocess command to the model script? (y/n)
n
[info] Saved HAR to: (...)/hds/hailo_model_zoo/yolov9-t.har

With this command, I receive a bunch of files, one of them being yolov9-t.har. This I continue to optimize with the following command and alls file:

.alls file

normalization1 = normalization([0.0, 0.0, 0.0], [255.0, 255.0, 255.0])
model_optimization_config(calibration, batch_size=2)
post_quantization_optimization(finetune, policy=enabled, learning_rate=1e-5)

Command
hailo optimize --hw-arch hailo8 --use-random-calib-set --calib-random-max 1 --work-dir ./wdir --model-script ~/hds/hailo_model_zoo/hailo_model_zoo/cfg/alls/generic/yolov9t.alls --output-har-path yolov9-t-converted.har yolov9-t.har

Yes, I’m aware that --use-random-calib-set is not optimal, I wanted to test the whole toolchain before deepdiving.

These are the respective logs, which again look fine to me:

[info] Current Time: 09:11:41, 07/30/24
[info] PCIe: b5:00.0: Number Of Lanes: 4, Speed: 8.0 GT/s PCIe
(...)
[info] Running `hailo optimize --hw-arch hailo8 --use-random-calib-set --calib-random-max 1 --work-dir ./wdir --model-script /home/user/hds/hailo_model_zoo/hailo_model_zoo/cfg/alls/generic/yolov9t.alls --output-har-path yolov9-t-converted.har yolov9-t.har`
[info] Loading model script commands to yolov9-t from /home/user/hds/hailo_model_zoo/hailo_model_zoo/cfg/alls/generic/yolov9t.alls
[info] Found model with 3 input channels, using real RGB images for calibration instead of sampling random data.
[info] Starting Model Optimization
[info] Using default optimization level of 2
[info] Model received quantization params from the hn
[info] Starting Mixed Precision
[info] Mixed Precision is done (completion time is 00:00:00.18)
[info] create_layer_norm skipped
[info] Starting Stats Collector
[info] Using dataset with 64 entries for calibration
Calibration: 100%|¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 64/64 [01:27<00:00,  1.37s/entries]
[info] Stats Collector is done (completion time is 00:01:32.82)
[info] No shifts available for layer yolov9-t/conv1/conv_op, using max shift instead. delta=4.770761047870071
[info] No shifts available for layer yolov9-t/conv1/conv_op, using max shift instead. delta=2.385380519565488
[info] Bias Correction skipped
[info] Adaround skipped
[info] Starting Fine Tune
[warning] Dataset is larger than expected size. Increasing the algorithm dataset size might improve the results
[info] Using dataset with 1024 entries for finetune
Epoch 1/4
437/512 [========================>.....] - ETA: 29s - total_distill_loss: 0.0831 - _distill_loss_yolov9-t/concat31: 0.0831
(...)
[info] Fine Tune is done (completion time is 00:16:57.09)
[info] Starting Layer Noise Analysis
Full Quant Analysis: 100%|¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 8/8 [05:14<00:00, 39.36s/iterations]
[info] Layer Noise Analysis is done (completion time is 00:05:22.68)
[info] Output layers signal-to-noise ratio (SNR): measures the quantization noise (higher is better)
[info]  yolov9-t/output_layer1 SNR:     16.04 dB
[info] Runtime input quantization on host will be required.
Adding normalization on chip could improve the performance, by making the quantization redundant.
For more information, see Hailo Dataflow Compiler user guide / Model Optimization / Optimization Related Model Script Commands / model_modification_commands / normalization
[info]  yolov9-t/input_layer1:
        Current range, per feature:     [(0.0, 1.0), (0.0, 1.0), (0.0, 1.0)]
        Expected range (for all features):      (0, 255)
[info] Model Optimization is done
[info] Saved HAR to: /home/user/hds/hailo_model_zoo/yolov9-t-converted.har

Now I try to compile the model:
hailo compiler --hw-arch hailo8 --model-script /home/user/hds/hailo_model_zoo/hailo_model_zoo/cfg/alls/generic/yolov9t.alls --output-dir . --output-har-path yolov9-t-compiled.har /home/user/hds/hailo_model_zoo/yolov9-t-converted.har

which fails:

[info] Current Time: 09:42:34, 07/30/24
[info] CPU: Architecture: x86_64, Model: Intel(R) Core(TM) i9-9900X CPU @ 3.50GHz, Number Of Cores: 20, Utilization: 0.1%
[info] Memory: Total: 62GB, Available: 56GB
[info] System info: OS: Linux, Kernel: 6.5.0-41-generic
[info] Hailo DFC Version: 3.27.0
[info] HailoRT Version: Not Installed
[info] PCIe: b5:00.0: Number Of Lanes: 4, Speed: 8.0 GT/s PCIe
[info] PCIe: b6:00.0: Number Of Lanes: 4, Speed: 8.0 GT/s PCIe
[info] PCIe: b7:00.0: Number Of Lanes: 4, Speed: 8.0 GT/s PCIe
[info] PCIe: b8:00.0: Number Of Lanes: 4, Speed: 8.0 GT/s PCIe
[info] Running `hailo compiler --hw-arch hailo8 --model-script /home/user/hds/hailo_model_zoo/hailo_model_zoo/cfg/alls/generic/yolov9t.alls --output-dir . --output-har-path yolov9-t-compiled.har /home/user/hds/hailo_model_zoo/yolov9-t-converted.har`
[info] Loading model script commands to yolov9-t from /home/user/hds/hailo_model_zoo/hailo_model_zoo/cfg/alls/generic/yolov9t.alls
[info] Compiling network
[info] Loading network parameters
[info] Starting Hailo allocation and compilation flow
[error] Mapping Failed (allocation time: 14s)
No successful assignment for: format_conversion1_defuse_reshape_hxf_to_w_transposed, format_conversion1_defuse_width_feature_reshape, concat31

[error] Failed to produce compiled graph
[error] BackendAllocatorException: Compilation failed: No successful assignment for: format_conversion1_defuse_reshape_hxf_to_w_transposed, format_conversion1_defuse_width_feature_reshape, concat31

Can someone look into this? I’m kind of stuck here

Hi @dennis.huegle,

I think that the problem here lies in the fact that the parsing node recommendation is wrong.
In general, the end nodes of the model should be the last layers that are not a part of the postprocessing layers. In this case here, I see that the end nodes suggested in a concat layer, which is in general not a good idea for an end nodes when using the Hailo SW. I suggest you try to parse it manually by defining the end nodes yourself (should be the last conv or activation layer/s).

I tried to use the Ultralytics API to export an ONNX form the .pt file you provided like so:

from ultralytics import YOLO
model = YOLO("./yolov9-t-converted.pt")

but got this error:

TypeError: ERROR ❌️ ../yolov9-t-converted.pt appears to be an Ultralytics YOLOv5 model originally trained with https://github.com/ultralytics/yolov5.
This model is NOT forwards compatible with YOLOv8 at https://github.com/ultralytics/ultralytics.
Recommend fixes are to train a new model using the latest 'ultralytics' package or to run a command with an official Ultralytics model, i.e. 'yolo predict model=yolov8n.pt'

Is this model really is of yolov9 architecture?

Regards,

Dear @Omer,

Is this model really is of yolov9 architecture?

I have the yolov9 weights from here:

https://github.com/WongKinYiu/yolov9/releases/download/v0.1/yolov9-t-converted.pt

It’s directly out of the yolov9 repository: YoloV9 → Readme → Performance → [YoloV9-T].

You can generate an onnx file like this:

git clone https://github.com/WongKinYiu/yolov9 && cd yolov9
pyhton -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pip install onnx onnxruntime onnx-simplifier
curl -L -O https://github.com/WongKinYiu/yolov9/releases/download/v0.1/yolov9-t-converted.pt
python export.py --weights yolov9-t-converted.pt --imgsz 640 640 --optimize --simplify --include onnx --batch-size 1

I can also upload it somewhere, if you provide me an upload space. Why exactly isn’t this a good idea?

Hi @dennis.huegle,
Thanks for the instructions. I was able to export the ONNX.
As I thought, the problem was incorrect suggestions by the Hailo Parser.
For me, the suggestion was this:

Parsing failed with recommendations for end node names: ['/model.22/dfl/Reshape_1']

which is deep inside the postprocessing ops:

While in reality, there should be 6 end nodes - the two convolution layer for each branch, for example:
image

You can see that from there, there are no more neural ops (the convolution layer at the end after the Softmax is de-facto a ReduceSum layer), so when you parse the model the parsing command should look like so:

hailo parser onnx yolov9-t-converted.onnx --end-node-names /model.22/cv2.0/cv2.0.2/Conv /model.22/cv3.0/cv3.0.2/Conv /model.22/cv2.1/cv2.1.2/Conv /model.22/cv3.1/cv3.1.2/Conv /model.22/cv2.2/cv2.2.2/Conv /model.22/cv3.2/cv3.2.2/Conv

When the output shapes of the end nodes should be the following:
(1,80,80,64)
(1,80,80,80)
(1,40,40,64)
(1,40,40,80)
(1,20,20,64)
(1,20,20,80)

Where the 64=4*16 channels are 4 bbox coordinates and 16 regression length, and the 80 channels is the number of classes the model was trained on.
You’ll probably have the option to add the Hailo NMS when you apply the command above, and I encourage you to do so as it’s more efficient and it will allow you to use the compiled HEF with our scripts and applications which you can find here:

Regards,

Dear @Omer ,
thank you very much for your help. The file converts now, but when I try to run it with hailortcli run ./yolov9-t.hef it fails:

hailortcli run ./yolov9-t.hef
Running streaming inference (./yolov9-t.hef):
  Transform data: true
    Type:      auto
    Quantized: true
[HailoRT] [error] CHECK failed - Failed opening non-compatible HEF with the following unsupported extensions: Periph configuration calculated in HailoRT (PERIPH_CALCULATION_IN_HAILORT)
[HailoRT] [error] CHECK_SUCCESS failed with status=HAILO_INVALID_HEF(26)
[HailoRT] [error] Failed parsing HEF file
[HailoRT] [error] Failed creating HEF
[HailoRT] [error] CHECK_EXPECTED failed with status=HAILO_INVALID_HEF(26)
[HailoRT CLI] [error] CHECK_EXPECTED failed with status=HAILO_INVALID_HEF(26) - Failed reading hef file ./yolov9-t.hef
[HailoRT] [error] CHECK failed - Failed opening non-compatible HEF with the following unsupported extensions: Periph configuration calculated in HailoRT (PERIPH_CALCULATION_IN_HAILORT)
[HailoRT] [error] CHECK_SUCCESS failed with status=HAILO_INVALID_HEF(26)
[HailoRT] [error] Failed parsing HEF file
[HailoRT] [error] Failed creating HEF
[HailoRT] [error] CHECK_EXPECTED failed with status=HAILO_INVALID_HEF(26)
[HailoRT CLI] [error] CHECK_EXPECTED_AS_STATUS failed with status=HAILO_INVALID_HEF(26)

Python env:

$ pip freeze | grep -i hailo
hailo-dataflow-compiler @ file:///home/vector/hailo_dataflow_compiler-3.27.0-py3-none-linux_x86_64.whl
hailo-model-zoo @ file:///home/vector/hds/hailo_model_zoo

os hailo env

$ apt list | grep -i hailo

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

golang-github-hailocab-go-hostpool-dev/jammy,jammy 0.0~git20160125.0.e80d13c-1.1 all
hailort-pcie-driver/now 4.12.0 all [installed,local]
hailort/now 4.12.0 amd64 [installed,local]

Sorry, looks like the HailoRT Version was too old. updated to 4.18 and it runs! Many Thanks!
It just has ~100 FPS however, which strikes me as being rather low, compared to YoloV7-tiny, which I converted to ~400 FPS without any optimization from my side. Any tips where I could look into?

This thread can be closed nevertheless

Best,
Dennis

Hi @dennis.huegle,
Please run “hailortcli parse-hef ./yolov9-t.hef” and see if the model is multi contexts or single context.

If it’s single context, there are two things you can do:

  1. In the optimization step - increase the compression level (might hurt the accuracy a bit as there will be more 4-bit weights).
  2. Run compilation with the alls commands “performance_param(compiler_optimization_level=max)” (will take a long time for compilation, but will give you the best performance results possible).

If it’s multiple contexts, it make sense that the FPS will be lowered compared to the single context yolov7-tiny compiled model.
Because a compiled model in Hailo is loaded on the chip, which has limited resources, if a model is too big to fit the chip’s resources we implement “multiple contexts”, meaning that the model’s is “broken” into two or more parts, where each part fits the chip’s resources, and only one part each time is run on the chip while the rest are stored in the host machine’s memory.
This allows to compile bigger models, but the overhead is that the performance (FPS, latency) would be lesser because of the context switching.
The are two optional ways to increase performance when you have a multi contexts model:

  1. Increase the batch size when running inference (hailortcli run ./yolov9-t.hef --batch-size 8, for example)
  2. Run compilation with the alls commands “performance_param(compiler_optimization_level=max)”

If it’s single context and the above mentioned suggestions doesn’t help, there’s not much we can do to increase performance.

Regards,