Sentence Similarity
sentence-transformers
PyTorch
TensorFlow
Rust
Safetensors
Transformers
English
bert
feature-extraction
Inference Endpoints
5 papers

Max Length

#1
by dadlifejason - opened

What should I do if most of the sentence length in my dataset is around 1000 which is larger than 256?

If truncating the sentence results in loss of meaning, maybe try summarizing the long sentences into shorter sentences.

@rmahfuz Why not compute vectors for parts that contain at most 256 words and then add up these vectors and normalize?

I'm not sure if you're all talking about the same unit! According to the model card, the model truncates all input (to the maximum of) 256 tokens. Not characters, not words, but tokens.

By the way, on the MTEB Leaderboard (https://huggingface.co/spaces/mteb/leaderboard), this information about the context length seems to be wrong ("512"), at least for this model...

@Ilianos Aren't tokens words? If not, what is a token?

@Ilianos Aren't tokens words? If not, what is a token?

It says on the model card:

By default, input text longer than 256 word pieces is truncated.

The rest, I'll let you find out on the internet :)

Sign up or log in to comment