zpn commited on
Commit
a9e759d
1 Parent(s): 14af554

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -1
README.md CHANGED
@@ -2596,4 +2596,71 @@ model-index:
2596
  value: 86.38437691024106
2597
  - type: max_f1
2598
  value: 78.79039565086076
2599
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2596
  value: 86.38437691024106
2597
  - type: max_f1
2598
  value: 78.79039565086076
2599
+ ---
2600
+
2601
+ # nomic-embed-text-v1-ablated: A Reproducible Long Context (8192) Text Embedder
2602
+
2603
+ `nomic-embed-text-v1-ablated` is 8192 context length text encoder that surpasses OpenAI text-embedding-ada-002 performance on short and long tasks.
2604
+ .
2605
+
2606
+
2607
+ | Name | SeqLen | MTEB | LoCo | Jina Long Context | Open Weights | Open Training Code | Open Data |
2608
+ | :-------------------------------:| :----- | :-------- | :------: | :---------------: | :-----------: | :----------------: | :---------- |
2609
+ | nomic-embed-text-v1 | 8192 | **62.39** |**85.53** | 54.16 | ✅ | ✅ | ✅ |
2610
+ | jina-embeddings-v2-base-en | 8192 | 60.39 | 85.45 | 51.90 | ✅ | ❌ | ❌ |
2611
+ | text-embedding-3-small | 8191 | 62.26 | 82.40 | **58.20** | ❌ | ❌ | ❌ |
2612
+ | text-embedding-ada-002 | 8191 | 60.99 | 52.7 | 55.25 | ❌ | ❌ | ❌ |
2613
+
2614
+
2615
+ If you would like to finetune a model on more data, you can use this model as an initialization
2616
+
2617
+
2618
+ ## Training Details
2619
+
2620
+ We train our embedder using a multi-stage training pipeline. Starting from a long-context [BERT model](https://huggingface.co/nomic-ai/nomic-bert-2048),
2621
+ the first unsupervised contrastive stage trains on a dataset generated from weakly related text pairs, such as question-answer pairs from forums like StackExchange and Quora, title-body pairs from Amazon reviews, and summarizations from news articles.
2622
+
2623
+ In the second finetuning stage, higher quality labeled datasets such as search queries and answers from web searches are leveraged. Data curation and hard-example mining is crucial in this stage.
2624
+
2625
+ For more details, see Nomic Embed [Technical Report](https://static.nomic.ai/reports/2024_Nomic_Embed_Text_Technical_Report.pdf).
2626
+
2627
+ Training data to train the models is released in its entirety. For more details, see the `contrastors` [repository](https://github.com/nomic-ai/contrastors)
2628
+
2629
+ ## Usage
2630
+
2631
+
2632
+ ```python
2633
+ import torch
2634
+ import torch.nn.functional as F
2635
+ from transformers import AutoTokenizer, AutoModel
2636
+
2637
+ def mean_pooling(model_output, attention_mask):
2638
+ token_embeddings = model_output[0]
2639
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
2640
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
2641
+
2642
+ sentences = ['What is TSNE?', 'Who is Laurens van der Maaten?']
2643
+
2644
+ tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
2645
+ model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1-unsupervised', trust_remote_code=True)
2646
+
2647
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
2648
+
2649
+ with torch.no_grad():
2650
+ model_output = model(**encoded_input)
2651
+
2652
+ embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
2653
+ embeddings = F.normalize(embeddings, p=2, dim=1)
2654
+ print(embeddings)
2655
+ ```
2656
+
2657
+ The model natively supports scaling of the sequence length past 2048 tokens. To do so,
2658
+
2659
+ ```diff
2660
+ - tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
2661
+ + tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', model_max_length=8192)
2662
+
2663
+
2664
+ - model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1-unsupervised', trust_remote_code=True)
2665
+ + model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1-unsupervised', trust_remote_code=True, rotary_scaling_factor=2)
2666
+ ```