Max Input Length Documentation

#1
by sondalex - opened

Hi, the repository README mentions:

By default, input text longer than 128 word pieces is truncated.

However, the parameter max_seq_length from sentence_transformers returns 512.

from sentence_transformers import SentenceTransformer
model_st = SentenceTransformer('all-mpnet-base-v1')
model_st.max_seq_length
# 512

Same value is returned for the Hugging face transformer approach:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v1')
tokenizer.model_max_length
# 512

Shouldn't the README be updated from 128 to 512 ?


Output of pip freeze:

...
sentence-transformers==2.2.2
huggingface-hub==0.10.1
transformers==4.23.1
torch==1.12.1
...

I have the same question! Looking to embed text up to the maximum sequence length of 512. I am assuming it won't be truncated at 128 despite what the README says?

That's a great observation, thank you for posting this

Sign up or log in to comment