Max Length

#1
by dadlifejason - opened

What should I do if most of the sentence length in my dataset is around 1000 which is larger than 256?

If truncating the sentence results in loss of meaning, maybe try summarizing the long sentences into shorter sentences.

@rmahfuz Why not compute vectors for parts that contain at most 256 words and then add up these vectors and normalize?

I'm not sure if you're all talking about the same unit! According to the model card, the model truncates all input (to the maximum of) 256 tokens. Not characters, not words, but tokens.

By the way, on the MTEB Leaderboard (https://huggingface.co/spaces/mteb/leaderboard), this information about the context length seems to be wrong ("512"), at least for this model...

@Ilianos Aren't tokens words? If not, what is a token?

@Ilianos Aren't tokens words? If not, what is a token?

It says on the model card:

By default, input text longer than 256 word pieces is truncated.

The rest, I'll let you find out on the internet :)

@Ilianos Aren't tokens words? If not, what is a token?

Tokens are how words are split into smaller chunks. For example; playing can be tokenized into "play", "ing" one word into two tokens. To learn more search for Byte-pair tokenization (BPE)

Sign up or log in to comment