Max Length
What should I do if most of the sentence length in my dataset is around 1000 which is larger than 256?
If truncating the sentence results in loss of meaning, maybe try summarizing the long sentences into shorter sentences.
I'm not sure if you're all talking about the same unit! According to the model card, the model truncates all input (to the maximum of) 256 tokens. Not characters, not words, but tokens.
By the way, on the MTEB Leaderboard (https://huggingface.co/spaces/mteb/leaderboard), this information about the context length seems to be wrong ("512"), at least for this model...
@Ilianos Aren't tokens words? If not, what is a token?
Tokens are how words are split into smaller chunks. For example; playing can be tokenized into "play", "ing" one word into two tokens. To learn more search for Byte-pair tokenization (BPE)