Technical Information

#9
by gsaivinay - opened

Hello, Thanks for this model.

Could you please provide information about this model?

  • What is the max input length of a sequence to generate embeddings? Will this model be useful for 1024 sequence length of a document text?
  • What is the dimension length of the output embedding vectors?
NLP Group of The University of Hong Kong org

Hi, Thanks a lot for your interest in the INSTRUCTOR model!

  1. By default, the maximum input length is 512, but it should be compatible with documents that have sequence length 1024.
  2. The dimension of embedding vectors is 768.

Feel free to add any further questions or comments.

Thank you very much for your reply.

I've few thousand documents, and some of them can be as big as 2500+ tokens. If I split those bigger models into 512 chunks, will this model be effective in fetching them?

NLP Group of The University of Hong Kong org

Yes. As the model is trained with maximum length 512, it is expected to work better if long documents are split into shorter chunks.

Feel free to add any further questions or comments!

Is the text in the instruction counted towards the number max number of tokens? example, if the instruction has 12 tokens, then the max number of tokens in the text is 500 ?

NLP Group of The University of Hong Kong org

Yes, the instruction is included in the maximum length calculation.

Sign up or log in to comment