Halio RT tensor configuration and memory allocation question

Hi, I have recently been adapting some new models for Halio RT and have some confusion regarding the input and output tensors of Halio RT, for which I am seeking assistance.

Here are my steps:

  1. Create ConfiguredInferModel from InferModel.
  2. Obtain the shape, format, and frame_size of the input and output tensors from InferModel.
  3. Allocate memory for the input and output Tensors on the Host heap with a size of frame_size.
  4. Bind the input and output tensors to the allocated memory using set_buffer through ConfiguredInferModel::Bindings.

I have the following doubts during this process:

  1. Knowing the shape and format.type of a tensor, can I assume that the frame_size on the Host is already determined and is its byte size? In other words, do I need to consider the memory layout on the Device side when allocating it?
  2. Based on the above steps, when parsing the output, do I only need to care about the format.order in the Host’s memory layout? For example, with HAILO_FORMAT_ORDER_FCR, do I just need to iterate my allocated memory according to [N, H, W, C]?
  3. After set_buffer, when I need to predict the next frame, do I still need to call set_buffer again?

Furthermore, assuming there’s a shape of 1, 1, 2, 2 for a UINT8 FCR tensor, I should allocate a memory size of 1 * 2 * 2 * 1 * sizeof(UINT8) = 4 bytes on the Host. Is the possible memory layout on the Device the same as the example below?

/**
 * FCR means first channels (features) are sent to HW:
 *  - Host side: [N, H, W, C]
 *  - Device side: [N, H, W, C]:
 *      - Input - channels are expected to be aligned to 8 bytes
 *      - Output - width is padded to 8 bytes
 */

- Host
             value(addr)
flattened: [ 1(0x00) , 2(0x01) , 3(0x02) , 4(0x03) ]
index:       0,0,0,0 | 0,0,0,1 | 0,0,1,0 | 0,0,1,1
             N,H,W,C

- Device (Input)
flattened: [ 1(0x00) ... 2(0x08) ... 3(0x10) ... 4(0x18) ]
index:       0,0,0,0  |  0,0,0,1  |  0,0,1,0  |  0,0,1,1

- Device (Output)
flattened: [ 1(0x00) , 2(0x01) ... 3(0x08) , 4(0x09) ]
index:       0,0,0,0 | 0,0,0,1  |  0,0,1,0 | 0,0,1,1

If there are any mistakes in my understanding, please tell me. Thank you!

Hey @eka

Welcome to the Hailo Community!

Your steps and questions reveal a solid understanding of how HailoRT handles input and output tensors, but I’ll clarify the concepts and address potential misunderstandings:

  1. Host Memory Allocation and frame_size:

    • You’re right! The frame_size returned from the tensor’s configuration (like ConfiguredInferModel) represents the byte size needed for the tensor on the host side. No need to worry about the device’s internal memory layout when allocating memory on the host.
    • The frame_size already includes any necessary padding and alignments required by the device. So, you can directly use frame_size for host-side allocation without any further adjustments.
  2. Memory Layout for Output Parsing:

    • When parsing the output tensor, follow the format.order to iterate through your allocated memory on the host. For HAILO_FORMAT_ORDER_FCR, you’re correct in iterating over the layout [N, H, W, C] for host-side operations.
    • The device might store the tensor differently internally (like with padding or alignments for optimization), but HailoRT APIs handle this transformation when transferring data between the host and device. Just focus on the host memory layout, and you’ll be good to go!
  3. Reusing Allocated Memory with set_buffer:

    • Once you’ve called set_buffer, you don’t need to call it again for each frame as long as the buffer allocation stays the same and the memory is reused for subsequent frames.
    • You only need to call set_buffer again if the tensor shape changes or the memory allocation is reallocated or invalidated.
  4. Memory Allocation for Specific Shapes:

    • For a tensor shape of [1, 1, 2, 2] with format UINT8 FCR, your calculation for the required memory size on the host (1 * 2 * 2 * 1 * sizeof(UINT8) = 4 bytes) is spot-on!
    • The device’s memory layout may differ from the host:
      • Input: Channels may be aligned to 8 bytes.
      • Output: Width may be padded to 8 bytes.
    • These alignments are device-specific optimizations handled by the HailoRT runtime. As long as you provide the correct frame_size for the tensor on the host, the runtime takes care of these details during data transfers.
  5. Device and Host Memory Layout Example:

    • The example you provided is a great illustration of how HailoRT typically handles memory layouts:
      • Host (Input/Output): [N, H, W, C] without additional padding.
      • Device Input: Channels aligned to 8 bytes.
      • Device Output: Width padded to 8 bytes.
    • The flattened memory values and indexing in your example correctly demonstrate how the device may internally align or pad data for performance reasons. You don’t need to worry about these transformations, as HailoRT ensures that host-to-device and device-to-host memory mappings are handled transparently.

I hope this clears things up! If you have any more questions or need further clarification, feel free to ask. We’re here to help!

Thanks for your insightful and professional response regarding tensor configuration and memory allocation in Hailo RT. Your expertise is greatly appreciated and very helpful to me.