nomic-ai
/

nomic-embed-text-v1-unsupervised

Sentence Similarity

sentence-transformers

PyTorch

text-embeddings-inference

Model card Files Files and versions Community

zpn

MaxNomic commited on Aug 2, 2024

Commit

b53d557

verified ·

1 Parent(s): 1d03a35

remove details about v1 from other checkpoint (#4)

Browse files

- remove details about v1 from other checkpoint (869be4070611ad5b66a9349cdcfd72040ac5813e)

Co-authored-by: Max Cembalest <MaxNomic@users.noreply.huggingface.co>

Files changed (1) hide show

README.md +2 -102

README.md CHANGED Viewed

@@ -2612,110 +2612,10 @@ model-index:
 # nomic-embed-text-v1-unsupervised: A Reproducible Long Context (8192) Text Embedder
 `nomic-embed-text-v1-unsupervised` is 8192 context length text encoder. This is a checkpoint after contrastive pretraining from multi-stage contrastive training of the
-[final model](https://huggingface.co/nomic-ai/nomic-embed-text-v1). If you want to extract embeddings, we suggest using [nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1)
-.
-| Name                             | SeqLen | MTEB      | LoCo     | Jina Long Context |  Open Weights | Open Training Code | Open Data   |
-| :-------------------------------:| :----- | :-------- | :------: | :---------------: | :-----------: | :----------------: | :---------- |
-| nomic-embed-text-v1              | 8192   | **62.39** |**85.53** | 54.16             | ✅            | ✅                  | ✅          |
-| jina-embeddings-v2-base-en       | 8192   | 60.39     | 85.45    | 51.90             | ✅            | ❌                  | ❌          |
-| text-embedding-3-small           | 8191   | 62.26     | 82.40    | **58.20**         | ❌            | ❌                  | ❌          |
-| text-embedding-ada-002           | 8191   | 60.99     | 52.7     | 55.25             | ❌            | ❌                  | ❌          |
-If you would like to finetune a model on more data, you can use this model as an initialization
-## Hosted Inference API
-The easiest way to get started with Nomic Embed is through the Nomic Embedding API.
-Generating embeddings with the `nomic` Python client is as easy as
-```python
-from nomic import embed
-output = embed.text(
-    texts=['Nomic Embedding API', '#keepAIOpen'],
-    model='nomic-embed-text-v1',
-    task_type='search_document'
-)
-print(output)
-```
-For more information, see the [API reference](https://docs.nomic.ai/reference/endpoints/nomic-embed-text)
-## Data Visualization
-Click the Nomic Atlas map below to visualize a 5M sample of our contrastive pretraining data!
-[![image/webp](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/pjhJhuNyRfPagRd_c_iUz.webp)](https://atlas.nomic.ai/map/nomic-text-embed-v1-5m-sample)
-## Training Details
-We train our embedder using a multi-stage training pipeline. Starting from a long-context [BERT model](https://huggingface.co/nomic-ai/nomic-bert-2048),
-the first unsupervised contrastive stage trains on a dataset generated from weakly related text pairs, such as question-answer pairs from forums like StackExchange and Quora, title-body pairs from Amazon reviews, and summarizations from news articles.
-In the second finetuning stage, higher quality labeled datasets such as search queries and answers from web searches are leveraged. Data curation and hard-example mining is crucial in this stage.
-For more details, see the Nomic Embed [Technical Report](https://static.nomic.ai/reports/2024_Nomic_Embed_Text_Technical_Report.pdf) and corresponding [blog post](https://blog.nomic.ai/posts/nomic-embed-text-v1).
-Training data to train the models is released in its entirety. For more details, see the `contrastors` [repository](https://github.com/nomic-ai/contrastors)
-## Usage
-Note `nomic-embed-text` requires prefixes! We support the prefixes `[search_query, search_document, classification, clustering]`.
-For retrieval applications, you should prepend `search_document` for all your documents and `search_query` for your queries.
-### Sentence Transformers
-```python
-from sentence_transformers import SentenceTransformer
-model = SentenceTransformer("nomic-ai/nomic-embed-text-v1-unsupervised", trust_remote_code=True)
-sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']
-embeddings = model.encode(sentences)
-print(embeddings)
-```
-### Transformers
-```python
-import torch
-import torch.nn.functional as F
-from transformers import AutoTokenizer, AutoModel
-def mean_pooling(model_output, attention_mask):
-    token_embeddings = model_output[0]
-    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
-    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
-sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']
-tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
-model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1-unsupervised', trust_remote_code=True)
-model.eval()
-encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
-with torch.no_grad():
-    model_output = model(**encoded_input)
-embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
-embeddings = F.normalize(embeddings, p=2, dim=1)
-print(embeddings)
-```
-The model natively supports scaling of the sequence length past 2048 tokens. To do so,
-```diff
-- tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
-+ tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', model_max_length=8192)
-- model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1-unsupervised', trust_remote_code=True)
-+ model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1-unsupervised', trust_remote_code=True, rotary_scaling_factor=2)
-```
 # Join the Nomic Community

 # nomic-embed-text-v1-unsupervised: A Reproducible Long Context (8192) Text Embedder
 `nomic-embed-text-v1-unsupervised` is 8192 context length text encoder. This is a checkpoint after contrastive pretraining from multi-stage contrastive training of the
+[final model](https://huggingface.co/nomic-ai/nomic-embed-text-v1). The purpose of releasing this checkpoint is to open-source training artifacts from our Nomic Embed Text tech report [here](https://arxiv.org/pdf/2402.01613)
+If you want to use a model to extract embeddings, we suggest using [nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1).
 # Join the Nomic Community