I’m confused about Zero Shot Classification and Text Image Retrieval on Hailo8 in the Hailo Model Zoo.

First, is it correct that the clip_text_encoder(s) under Text Image Retrieval correspond to the text_encoder in CLIP, and that the Zero Shot Classification entries correspond to the image_encoder?

If so, I can understand that the output of the text encoder (clip_text_encoder_vitb_16) is (Batch) x (Maximum Token) x (Embedding Size), i.e., 1 x 77 x 512, but I don’t understand why the input is also 1 x 77 x 512.

Hi @Park_Hanyu,

Welcome to the Hailo Community!

There are two separate networks, one for text encoding (used for encoding the prompts) and another for image encoding. The inference results from both are then used to get the probabilities for the prompts.

1 Like