Hi, great project idea!
Reference project that might be helpful: hailo-apps/hailo_apps/python/pipeline_apps/clip at main · hailo-ai/hailo-apps · GitHub
The HEF includes normalization on-chip. It doesn’t include resizing/cropping - you need to handle that.
What the HEF does (on-chip)
The TinyCLIP image encoder HEFs from the Hailo Model Zoo are compiled with normalization baked in. This is configured in the Model Zoo’s base/clip.yaml config:
parser:
normalization_params:
normalize_in_net: true
mean_list: [122.7709383, 116.7460125, 104.09373615]
std_list: [68.5005327, 66.6321579, 70.32316305]
normalize_in_net: true means these normalization parameters were fused into the HEF at compile time. The HEF expects UINT8 RGB pixels in [0, 255] and internally does (pixel - mean) / std. No need to normalize in preprocessing code.
What you can do yourself (preprocessing)
The standard CLIP preprocessing that you need to implement in your C++ / Python code:
- Resize the image so the shorter side is 224 pixels (bicubic interpolation), preserving aspect ratio
- Center-crop to 224x224
- Feed the resulting 224x224x3 UINT8 RGB image directly to the HEF
This matches the Model Zoo’s clip preprocessing function in classification_preprocessing.py:
@PREPROCESS_FACTORY.register
def clip(image, image_info=None, output_height=None, output_width=None, **kwargs):
image = pil_resize(image, output_height, output_width) # Bicubic resize to 224x224
image = tf.cast(image, tf.float32)
return image, image_info
Verify with hailortcli parse-hef: This will show the input layer format (UINT8, NHWC, 224x224x3). If normalization is baked in, the input expects raw [0,255] values - which is exactly the case here.
You can also look at the model’s .alls script in the Model Zoo at cfg/alls/tinyclip_vit_39m_16_text_19m_yfcc15m_image_encoder.alls to see what compile-time modifications were applied.
For the text encoder
The text encoder HEF expects tokenized input (1x77x512 - already embedded tokens), NOT raw text. You’ll need to:
- Tokenize text using the TinyCLIP tokenizer (from HuggingFace
wkcn/TinyCLIP-ViT-39M-16-Text-19M-YFCC15M)
- Run the token embedding layer on CPU (this is a simple lookup table)
- Feed the 1x77x512 tensor to the HEF
The text encoder also has a postprocessing step: extract the last token’s hidden state and multiply by the text projection matrix, then L2-normalize. See text_encoding_postprocess.py in the Model Zoo.
Hope this helps with your image search project!