How does a HEF file transform data? Does it preprocess images automatically?

Hello everyone,

I am working on a project to create an image search GUI to find images the fit a specific prompt. For this, I am using the TinyCLIP image and text encoders from the Hailo model zoo. I am writing the code in both C++ and python. The image encoding will happen on around 100k images and will run as an async inference in C++, whereas the image search and text encoding will happen in python.

Where I am struggling is to figure out if the HEF preprocesses the images or do I have to implement my own preprocessing. If it does preprocess the images, then how can I see what it is doing exactly?

Links to the tinyclip_vit_39m encoders:

Image encoder: hailo_model_zoo/docs/public_models/HAILO8/HAILO8_zero_shot_classification.rst at master · hailo-ai/hailo_model_zoo · GitHub
Text encoder: hailo_model_zoo/docs/public_models/HAILO8/HAILO8_text_image_retrieval.rst at master · hailo-ai/hailo_model_zoo · GitHub

Hi, great project idea!

Reference project that might be helpful: hailo-apps/hailo_apps/python/pipeline_apps/clip at main · hailo-ai/hailo-apps · GitHub

The HEF includes normalization on-chip. It doesn’t include resizing/cropping - you need to handle that.

What the HEF does (on-chip)

The TinyCLIP image encoder HEFs from the Hailo Model Zoo are compiled with normalization baked in. This is configured in the Model Zoo’s base/clip.yaml config:

parser:
  normalization_params:
    normalize_in_net: true
    mean_list: [122.7709383, 116.7460125, 104.09373615]
    std_list: [68.5005327, 66.6321579, 70.32316305]

normalize_in_net: true means these normalization parameters were fused into the HEF at compile time. The HEF expects UINT8 RGB pixels in [0, 255] and internally does (pixel - mean) / std. No need to normalize in preprocessing code.

What you can do yourself (preprocessing)

The standard CLIP preprocessing that you need to implement in your C++ / Python code:

  1. Resize the image so the shorter side is 224 pixels (bicubic interpolation), preserving aspect ratio
  2. Center-crop to 224x224
  3. Feed the resulting 224x224x3 UINT8 RGB image directly to the HEF

This matches the Model Zoo’s clip preprocessing function in classification_preprocessing.py:

@PREPROCESS_FACTORY.register
def clip(image, image_info=None, output_height=None, output_width=None, **kwargs):
    image = pil_resize(image, output_height, output_width)  # Bicubic resize to 224x224
    image = tf.cast(image, tf.float32)
    return image, image_info

Verify with hailortcli parse-hef: This will show the input layer format (UINT8, NHWC, 224x224x3). If normalization is baked in, the input expects raw [0,255] values - which is exactly the case here.
You can also look at the model’s .alls script in the Model Zoo at cfg/alls/tinyclip_vit_39m_16_text_19m_yfcc15m_image_encoder.alls to see what compile-time modifications were applied.

For the text encoder

The text encoder HEF expects tokenized input (1x77x512 - already embedded tokens), NOT raw text. You’ll need to:

  1. Tokenize text using the TinyCLIP tokenizer (from HuggingFace wkcn/TinyCLIP-ViT-39M-16-Text-19M-YFCC15M)
  2. Run the token embedding layer on CPU (this is a simple lookup table)
  3. Feed the 1x77x512 tensor to the HEF

The text encoder also has a postprocessing step: extract the last token’s hidden state and multiply by the text projection matrix, then L2-normalize. See text_encoding_postprocess.py in the Model Zoo.

Hope this helps with your image search project!

1 Like

Could’ve never asked for a better explanation. Not only you told me what I should do but showing me where I can check these things is awesome. Thanks a lot <3

1 Like

Thanks @aiybay nice to hear and you always welcome!