Simplify usage; integrate Sentence Transformers (+ LlamaIndex/LangChain, etc.)

#39
by tomaarsen HF staff - opened

Hello!

Pull Request overview

  • Add add_eos_token=True to the tokenizer configuration
  • Integrate with Sentence Transformers

Details

With a5e1612ae784f73eba634b00acf64db7d99ad7e9 we can simplify the usage with transformers: No more tokenizing + updating manually + padding (which is rather slow, also). The primary downside here is people currently using

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=max_length - 1, return_attention_mask=False, padding=False, truncation=True)
# append eos_token_id to every input_ids
batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']]
batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt')

might now unexpectedly get two EOS tokens. However, my personal opinion is that this is not a big problem because the padding tokens and the EOS tokens are identical. The only difference is that for those users, each batch will have 1 more padding token than strictly necessary.

The upside is that we no longer have to "hack" our batch, and so we can much more cleanly integrate with other tools, such as Sentence Transformers, LlamaIndex, LangChain, Haystack, etc. I've done that in 90cddb89718db232ed6257f69b36e9e601120883. The usage then becomes:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("intfloat/e5-mistral-7b-instruct")
# In case you want to reduce the maximum sequence length:
model.max_seq_length = 4096

queries = [
    "how much protein should a female eat",
    "summit define",
]
documents = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."
]

query_embeddings = model.encode(queries, prompt_name="web_search_query")
document_embeddings = model.encode(documents)

scores = (query_embeddings @ document_embeddings.T) * 100
print(scores.tolist())

I've also added some of the prompts from unilm/e5/utils.py into the model configuration (config_sentence_transformers.json), so users can apply those out of the box easily via prompt_name.

Because various third party applications (LlamaIndex, LangChain, Haystack) rely directly on Sentence Transformers, this also allows users to apply this model via those tools.

  • Tom Aarsen
tomaarsen changed pull request status to open

Very much appreciate your contribution!

intfloat changed pull request status to merged

Thank you very much for your contribution. However, as I used Sentence Transformer for evaluation on BEIR, the results drop by around 1-2 points. Is this behavior expected?
Thanks in advance.

Hello!

@intfloat used very specific prompts for the evaluation as described in https://huggingface.co/intfloat/e5-mistral-7b-instruct#faq. Perhaps you were using a general prompt or no prompt at all? See also https://github.com/microsoft/unilm/blob/9c0f1ff7ca53431fe47d2637dfe253643d94185b/e5/utils.py#L106

  • Tom Aarsen

Sign up or log in to comment