Sequence Length Setting for Sentence Transformer

#3
by dylanAtHum - opened

Hey,

First off, thanks for your work here.

I was testing out these SGPT models using the SentenceTransformer package and I noticed the sentence_bert_config.json has a max_seq_length=300 parameter. This causes the tokenizer to truncate at 300 tokens, but I know the model itself is intended to have a 2k length. I looked in Github and saw there that its suggested to load the AutoModel and AutoTokenizer then unpack the hidden layers and call the model through torch directly. Testing this gave me the correct 2k sequence length as best I can tell, but it might be worthwhile to modify that sentence_bert_config just for ease of use.

Thanks for noting! The reason it's set to 300 in the sentence_bert_config is because during finetuning all sentences were cut off at 300 tokens. I'm not sure how it performs at >300 tokens. For most tasks 300 tokens is enough to get a sufficiently comprehensive embedding.

Sign up or log in to comment