How to integreate this model with Sentence Transformers?

#31
by Nelson365487 - opened

I see the choice of pooling layer of this model is last token pooling base on the description in the model card section. Since I want to utilize this model with Sentence Transformers function. I try to add the pooling layer after loading the model with "sentence_transformers.models.Transformer". And I initiate the pooling layer with "sentence_transformers.models.Pooling(...,pooling_mode_mean_tokens=False,pooling_mode_lasttoken =True).
Finally, create the model with the pooling layer with "model = SentenceTransformer(modules=[word_embedding_model, pooling_model])"
However, the embeddings of the this custom model is very different from what i would get by following the code in the model card section.
Is there any misunderstanding while I integrate this model with Sentence Transformers? For example the realization of the pooling layer is different which leads to different result on embeddings.

Can you provide a minimal code snippet that can reproduce your results?

One issue about integrating with SentenceTransformers is that the tokenizer has to add an EOS token to the end of each input. I believe SentenceTransformers do not handle this automatically.

sentence-transformers should have added this new feature for EOS token.
See https://huggingface.co/Salesforce/SFR-Embedding-Mistral/discussions/1.

I have tried the merged configs in Salesforce/SFR-Embedding-Mistral and should work.
Hope to see it in intfloat/e5-mistral-7b-instruct!

Thanks for your replay @intfloat @Jonathan0528 . I check the add_eos_token in the tokenizer after loading model with SentenceTransformers, and just as @intfloat said, the tokenizer does not add EOS token autimatically. The reason of contradiction on what @Jonathan0528 said might be the version of my SentenceTransformers. My installed version is 2.2.2 which is quite old, I think. After setting the add_eos_token=True and redoing the example everything goes well. Thanks again @intfloat @Jonathan0528 .

Nelson365487 changed discussion status to closed

@Nelson365487 Where did you set the add_eos_token=True for this if @Jonathan0528 solution did not work?

@woofadu , Maybe you can try passing arguments with tokenizer_args while initializing the sentence_transformers.models.Transformer or try modify the tokenizer after the initalization.

@Nelson365487 modifying after initialization worked. Thank you

Sign up or log in to comment