hkunlp
/

instructor-xl

Model card Files Files and versions Community

Max tokens

by hiranya911 - opened Apr 13, 2023

Apr 13, 2023

Thanks for sharing this model with the community.

What's the max number of tokens that can be embedded with this? I noticed that it logs "max_seq_length 512" every time the model is loaded. Is that 512 characters?

multi-train

NLP Group of The University of Hong Kong org Apr 14, 2023

Thanks a lot for your interests in our INSTRUCTOR model!

Your understanding is correct! By default, the maximum sequence length is 512. For changing the maximum sequence length, you may refer to https://github.com/HKUNLP/instructor-embedding/issues/12.

Hope this helps! Feel free to add any further questions or comments!

hiranya911

Apr 14, 2023

Thanks for the link. That helped answer a number of questions I had.

What's the tokenizer I should use if I were to chunk a long text before generating embeddings? I skimmed through the code, and found references to AutoTransformer and T5. So will something like the following work?

TOKENIZER = T5Tokenizer.from_pretrained('t5-large', model_max_length=512)
SPLITTER = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(TOKENIZER, chunk_size=512, chunk_overlap=0)

multi-train

NLP Group of The University of Hong Kong org Apr 14, 2023

Hi, Thanks a lot for your comments!

The recommended tokenizer for calculating the sequence length would be the INSTRUCTOR tokenizer. For example:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('hkunlp/instructor-large') # initialize the INSTRUCTOR tokenizer
text = "Hello, world!"
text_length = len(tokenizer(text))
print(text_length)

Hope this helps! Feel free to add any further questions or comments!

HectorL

May 6, 2023

Very glad I found this thread. Is there any way to easily turn on a truncation warning? I have text that I'm chunking but can have large variations in token length.

HectorL

May 6, 2023

Hi, Thanks a lot for your comments!

The recommended tokenizer for calculating the sequence length would be the INSTRUCTOR tokenizer. For example:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('hkunlp/instructor-large') # initialize the INSTRUCTOR tokenizer
text = "Hello, world!"
text_length = len(tokenizer(text))
print(text_length)
Hope this helps! Feel free to add any further questions or comments!

Small fix:

text_length = len(tokenizer(text)['input_ids'])

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment