First, is it correct that the clip_text_encoder(s) under Text Image Retrieval correspond to the text_encoder in CLIP, and that the Zero Shot Classification entries correspond to the image_encoder?
If so, I can understand that the output of the text encoder (clip_text_encoder_vitb_16) is (Batch) x (Maximum Token) x (Embedding Size), i.e., 1 x 77 x 512, but I don’t understand why the input is also 1 x 77 x 512.