"max_seq_length": 512 in sentence_bert_config.json

#54

SentenceTransformer truncates the sentence to 256 without displaying messages to the user. The value of "max_seq_length": 512 in sentence_bert_config.json will allow the SentenceTransformer to use the entire length of the model context.

Sentence Transformers org

Hello!

The choice of a max_seq_length of 256 is very deliberate. The reasoning is that this corresponds roughly to the size of the training data, and that embedding models in particular do not "automatically" generalize to larger sequence lengths without training on sequences of that size. To prove this, I took the TRECCovid benchmark, which has texts that exceed 256 by a good bit, and evaluated both "normal" all-MiniLM-L6-v2 and one with a sequence length of 512.

As expected, the model with a sequence length of 512 took almost twice as long as it had more tokens to process, and in the end these were the results:

  • all-MiniLM-L6-v2 with "max_seq_length": 256: 0.47246 NDCG@10 (higher is better)
  • all-MiniLM-L6-v2 with "max_seq_length": 512: 0.46083 NDCG@10 (higher is better)

In short, by extending the maximum sequence length beyond the training size, you are shooting yourself in the foot. You'll get a slower model and it'll perform worse. The intuition is simply that these models have not been trained to perform on sequences that exceed 256 tokens. In fact, it wouldn't surprise me if a lesser sequence length actually results in superior performance, as I think 256 is a lot higher than what the model was actually trained on. I think the highest performance might be reached at around the point where the model was most often trained on.

  • all-MiniLM-L6-v2 with "max_seq_length": 128: 0.51262 NDCG@10 (higher is better)
  • all-MiniLM-L6-v2 with "max_seq_length": 64: 0.55458 NDCG@10 (higher is better)
  • all-MiniLM-L6-v2 with "max_seq_length": 32: 0.59725 NDCG@10 (higher is better)
  • all-MiniLM-L6-v2 with "max_seq_length": 16: 0.53039 NDCG@10 (higher is better)
  • all-MiniLM-L6-v2 with "max_seq_length": 8: 0.25168 NDCG@10 (higher is better)

Obviously going as low as 8 or 16 is unreasonable, you'll cut off the query & very important parts of the corpus, but the consensus is that longer is not always better - these models can already understand the topic with much fewer tokens, and more may just confuse them.

  • Tom Aarsen
tomaarsen changed pull request status to closed

Sign up or log in to comment