Max Length

by dadlifejason - opened Jul 22, 2022

Discussion

dadlifejason

Jul 22, 2022

What should I do if most of the sentence length in my dataset is around 1000 which is larger than 256?

rmahfuz

Aug 12, 2022

If truncating the sentence results in loss of meaning, maybe try summarizing the long sentences into shorter sentences.

drmeir

Aug 6, 2023

@rmahfuz Why not compute vectors for parts that contain at most 256 words and then add up these vectors and normalize?

Ilianos

Sep 7, 2023

I'm not sure if you're all talking about the same unit! According to the model card, the model truncates all input (to the maximum of) 256 tokens. Not characters, not words, but tokens.

By the way, on the MTEB Leaderboard (https://huggingface.co/spaces/mteb/leaderboard), this information about the context length seems to be wrong ("512"), at least for this model...

drmeir

Sep 9, 2023

@Ilianos Aren't tokens words? If not, what is a token?

Ilianos

Sep 10, 2023

@Ilianos Aren't tokens words? If not, what is a token?

It says on the model card:

By default, input text longer than 256 word pieces is truncated.

The rest, I'll let you find out on the internet :)

alsayedm

May 7

•

edited May 7

@Ilianos Aren't tokens words? If not, what is a token?

Tokens are how words are split into smaller chunks. For example; playing can be tokenized into "play", "ing" one word into two tokens. To learn more search for Byte-pair tokenization (BPE)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment