Add AutoTokenizer & Sentence Transformers support

#1
by tomaarsen HF staff - opened
Nomic AI org
edited Feb 1

Hello!

Pull Request overview

  • Add AutoTokenizer support.
  • Add Sentence Transformers support
  • Update some README metadata

Details

AutoTokenizer support

I saved the bert-base-uncased tokenizer into this repository (but with the max_model_length set to 8192), then you can use

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("nomic-ai/nomic-embed-text-v1")

Add Sentence Transformers support

return_dict was required, but it can be ignored as ST only uses return_dict=False. I also added the required files.

To experiment, feel free to run this:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True, revision="pr/1")
sentences = ['What is TSNE?', 'Who is Laurens van der Maaten?']
embeddings = model.encode(sentences)
print(embeddings)

It takes the model from this PR branch. You'll see that the embeddings match the mean pooled & normalized embeddings from the Transformers-based snippet.

Metadata

The metadata is used to tell Hugging Face that the model can be loaded with ST, this also creates a "Use with Sentence Transformers" button, for example; might boost the sharability of the model 💪

I also updated the README slightly. Feel free to make any suggestions or changes - it's your model after all :)

Note: The scarily large PR diff (60k lines) is because of the vocab.txt from the tokenizer.

  • Tom Aarsen
tomaarsen changed pull request status to open
zpn changed pull request status to merged
Nomic AI org

thank you!

Sign up or log in to comment