I see that the input shape for the model is (1, 77, 640). However when I run clip.tokenize(text) it only produces (1, 77). What is the 640 input size for the model?
Hi @tarmily.wen,
The output that you get (1,77) is the tokenized (ID only), before embedding. You need to create the embedded tokens, those would be 1,77,640 for the resnet50x4 model.