hkunlp
/

instructor-large

Model card Files Files and versions Community

Maximum token size?

#11

by rasharab - opened Jun 28, 2023

Discussion

rasharab

Jun 28, 2023

Can someone tell me what the maximum input token size for the instructor model?
I know for ada, I believe it's 8k.

multi-train

NLP Group of The University of Hong Kong org Jul 2, 2023

The default maximum length for the INSTRUCTOR model is 512.

LeMoussel

Jul 13, 2023

•

edited Aug 8, 2023

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-large')
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Science title:"
embeddings = model.encode([[instruction,sentence]])
print(embeddings)

Does this mean that the maximum length of sentencemust not exceed 512 characters?
If so, should sentencebe cut for every 512 tokens chunk ?

multi-train

NLP Group of The University of Hong Kong org Jul 22, 2023

Yes, it is recommended that the maximum length is under 512, and you can split texts into chunks for long documents

jwatte

Jul 27, 2023

512 is "tokens" not "characters," right?

LeMoussel

Aug 8, 2023

@jwatte Yes!

Language models have a token limit. You should not exceed the token limit.
You can split your text into chunks. It is therefore a good idea to count the number of tokens.
See: LangChain - Split by tokens

oskarini

Nov 4, 2023

•

edited Nov 4, 2023

Hello!

If I want to create one embedding for a longer document, what is the proposed way to do it?

Would it be to embed multiple chunks of 512 tokens and then take the average of the resulting embedding vectors?

LeMoussel

Jan 14, 2024

See Evaluating the Ideal Chunk Size for a RAG System using LlamaIndex.
This blog post explains the steps to determine the best fragment size using LlamaIndex's Response Evaluation module.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment