Dear all,
I am using hailort v4.23.0 with the c++ api.
I want to run inference with some resnet model, no issue at all and blazing fast when working with single channel, but when I want to run on up to 16 batched 256x256x4 images it massively slows down.
Running the model with hailortcli run I would expect 800 images / s, but the actual performance is close to 128 images/s.
If I investigate with perf top I see that most of the time is spent in the function hailort::reorder_input_stream.
My input workflow looks like the async_infer_basic_example.cpp example:
- Create vdevice and parse .hef model
- model->set_batch_size(16) and configure it
- Create 16 bindings with configuredModel.create_bindings()
- Allocate 16 page aligned buffers of size input.get_frame_size() (256x256x4 bytes)
- For each binding get a binding set_buffer( hailort::MemoryView(…)))
- For each buffer memcpy a HxWxC image into it
- call configuredIModel.run_async(bindings)
Am I doing something wrong? Should I drop to the lower level api and consider the padding on the channel dimension? Should I upgrade to version 5.x?