Maximum token size?

#11
by rasharab - opened

Can someone tell me what the maximum input token size for the instructor model?
I know for ada, I believe it's 8k.

NLP Group of The University of Hong Kong org

The default maximum length for the INSTRUCTOR model is 512.

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-large')
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Science title:"
embeddings = model.encode([[instruction,sentence]])
print(embeddings)

Does this mean that the maximum length of sentencemust not exceed 512 characters?
If so, should sentencebe cut for every 512 tokens chunk ?

NLP Group of The University of Hong Kong org

Yes, it is recommended that the maximum length is under 512, and you can split texts into chunks for long documents

512 is "tokens" not "characters," right?

@jwatte Yes!

Language models have a token limit. You should not exceed the token limit.
You can split your text into chunks. It is therefore a good idea to count the number of tokens.
See: LangChain - Split by tokens

Hello!

If I want to create one embedding for a longer document, what is the proposed way to do it?

Would it be to embed multiple chunks of 512 tokens and then take the average of the resulting embedding vectors?

See Evaluating the Ideal Chunk Size for a RAG System using LlamaIndex.
This blog post explains the steps to determine the best fragment size using LlamaIndex's Response Evaluation module.

Sign up or log in to comment