Embeddings dimension

#34
by ggioetto - opened

Hello,
I want first to thank you for the amazing project and for releasing the weights of the model.
I have a doubt

If I try to run the example code

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "jinaai/jina-embeddings-v2-base-en", trust_remote_code=True
)  # trust_remote_code is needed to use the encode method
embeddings = model.encode(
    ["How is the weather today?", "What is the current weather like today?"]
)

print(embeddings.shape)

I get an output with shape (2, 768), while on the MTEB leaderboard the embeddings dimension is 512.
Connected with this, is there a reason why the pooling layer size is different from the hidden size of the model?

Thanks in advance for the help

Jina AI org

hi @ggioetto thanks good catrch! Ill look into this, it is wired, the dimension should be 768, not sure how it is 512 on MTEB.

Our small outputs dim of 512, while base, due to larger model size, produce 768.

It's because your pooling config claims a word_embedding_dimension of 512. I'm not sure why, it doesn't appear to actually make the output 512, but it does seem to affect whatever automated system records model info.

bwang0911 changed discussion status to closed

Sign up or log in to comment