I’m confused about Zero Shot Classification and Text Image Retrieval on Hailo8 in the Hailo Model Zoo.

Park_Hanyu · August 13, 2025, 7:23am

First, is it correct that the clip_text_encoder(s) under Text Image Retrieval correspond to the text_encoder in CLIP, and that the Zero Shot Classification entries correspond to the image_encoder?

If so, I can understand that the output of the text encoder (clip_text_encoder_vitb_16) is (Batch) x (Maximum Token) x (Embedding Size), i.e., 1 x 77 x 512, but I don’t understand why the input is also 1 x 77 x 512.

nina-vilela · August 13, 2025, 8:40am

Hi @Park_Hanyu,

Welcome to the Hailo Community!

There are two separate networks, one for text encoding (used for encoding the prompts) and another for image encoding. The inference results from both are then used to get the probabilities for the prompts.

Topic		Replies	Views
Large difference between compiled model for CLIP General	9	159	October 24, 2024
CLIP ResNet-50 / ResNet-50x4 Training Pipeline and Deployment on Hailo-8L General raspberry-pi , hailo8	1	41	June 24, 2025
Multi-input models support? (Zeroshot Detection) General dfc , hailo8	1	178	July 10, 2024
How to use clip_text_encoder_resnet50x4.hef? General	1	68	October 27, 2024
Run Clip application on Raspberrypi5 and Hailo8 General gstreamer , raspberry-pi , hailo8	2	144	January 15, 2025

I’m confused about Zero Shot Classification and Text Image Retrieval on Hailo8 in the Hailo Model Zoo.

Related topics