Files changed (1) hide show
  1. README.md +55 -55
README.md CHANGED
@@ -2609,63 +2609,8 @@ language:
2609
 
2610
  # nomic-embed-text-v1.5: Resizable Production Embeddings with Matryoshka Representation Learning
2611
 
2612
- `nomic-embed-text-v1.5` is an improvement upon [Nomic Embed](https://huggingface.co/nomic-ai/nomic-embed-text-v1) that utilizes [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) which gives developers the flexibility to trade off the embedding size for a negligible reduction in performance.
2613
-
2614
-
2615
-
2616
- | Name | SeqLen | Dimension | MTEB |
2617
- | :-------------------------------:| :----- | :-------- | :------: |
2618
- | nomic-embed-text-v1 | 8192 | 768 | **62.39** |
2619
- | nomic-embed-text-v1.5 | 8192 | 768 | 62.28 |
2620
- | nomic-embed-text-v1.5 | 8192 | 512 | 61.96 |
2621
- | nomic-embed-text-v1.5 | 8192 | 256 | 61.04 |
2622
- | nomic-embed-text-v1.5 | 8192 | 128 | 59.34 |
2623
- | nomic-embed-text-v1.5 | 8192 | 64 | 56.10 |
2624
-
2625
-
2626
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/CRnaHV-c2wMUMZKw72q85.png)
2627
-
2628
  **Exciting Update!**: `nomic-embed-text-v1.5` is now multimodal! [nomic-embed-vision-v1](https://huggingface.co/nomic-ai/nomic-embed-vision-v1.5) is aligned to the embedding space of `nomic-embed-text-v1.5`, meaning any text embedding is multimodal!
2629
 
2630
-
2631
- ## Hosted Inference API
2632
-
2633
- The easiest way to get started with Nomic Embed is through the Nomic Embedding API.
2634
-
2635
- Generating embeddings with the `nomic` Python client is as easy as
2636
-
2637
- ```python
2638
- from nomic import embed
2639
-
2640
- output = embed.text(
2641
- texts=['Nomic Embedding API', '#keepAIOpen'],
2642
- model='nomic-embed-text-v1.5',
2643
- task_type='search_document',
2644
- dimensionality=256,
2645
- )
2646
-
2647
- print(output)
2648
- ```
2649
-
2650
- For more information, see the [API reference](https://docs.nomic.ai/reference/endpoints/nomic-embed-text)
2651
-
2652
- ## Data Visualization
2653
- Click the Nomic Atlas map below to visualize a 5M sample of our contrastive pretraining data!
2654
-
2655
-
2656
- [![image/webp](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/pjhJhuNyRfPagRd_c_iUz.webp)](https://atlas.nomic.ai/map/nomic-text-embed-v1-5m-sample)
2657
-
2658
- ## Training Details
2659
-
2660
- We train our embedder using a multi-stage training pipeline. Starting from a long-context [BERT model](https://huggingface.co/nomic-ai/nomic-bert-2048),
2661
- the first unsupervised contrastive stage trains on a dataset generated from weakly related text pairs, such as question-answer pairs from forums like StackExchange and Quora, title-body pairs from Amazon reviews, and summarizations from news articles.
2662
-
2663
- In the second finetuning stage, higher quality labeled datasets such as search queries and answers from web searches are leveraged. Data curation and hard-example mining is crucial in this stage.
2664
-
2665
- For more details, see the Nomic Embed [Technical Report](https://static.nomic.ai/reports/2024_Nomic_Embed_Text_Technical_Report.pdf) and corresponding [blog post](https://blog.nomic.ai/posts/nomic-embed-matryoshka).
2666
-
2667
- Training data to train the models is released in its entirety. For more details, see the `contrastors` [repository](https://github.com/nomic-ai/contrastors)
2668
-
2669
  ## Usage
2670
 
2671
  **Important**: the text prompt *must* include a *task instruction prefix*, instructing the model which task is being performed.
@@ -2818,6 +2763,61 @@ embeddings = layer_norm(embeddings, [embeddings.dims[1]])
2818
  console.log(embeddings.tolist());
2819
  ```
2820
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2821
  # Join the Nomic Community
2822
 
2823
  - Nomic: [https://nomic.ai](https://nomic.ai)
 
2609
 
2610
  # nomic-embed-text-v1.5: Resizable Production Embeddings with Matryoshka Representation Learning
2611
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2612
  **Exciting Update!**: `nomic-embed-text-v1.5` is now multimodal! [nomic-embed-vision-v1](https://huggingface.co/nomic-ai/nomic-embed-vision-v1.5) is aligned to the embedding space of `nomic-embed-text-v1.5`, meaning any text embedding is multimodal!
2613
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2614
  ## Usage
2615
 
2616
  **Important**: the text prompt *must* include a *task instruction prefix*, instructing the model which task is being performed.
 
2763
  console.log(embeddings.tolist());
2764
  ```
2765
 
2766
+
2767
+ ## Nomic API
2768
+
2769
+ The easiest way to use Nomic Embed is through the Nomic Embedding API.
2770
+
2771
+ Generating embeddings with the `nomic` Python client is as easy as
2772
+
2773
+ ```python
2774
+ from nomic import embed
2775
+
2776
+ output = embed.text(
2777
+ texts=['Nomic Embedding API', '#keepAIOpen'],
2778
+ model='nomic-embed-text-v1.5',
2779
+ task_type='search_document',
2780
+ dimensionality=256,
2781
+ )
2782
+
2783
+ print(output)
2784
+ ```
2785
+
2786
+ For more information, see the [API reference](https://docs.nomic.ai/reference/endpoints/nomic-embed-text)
2787
+
2788
+
2789
+ ## Adjusting Dimensionality
2790
+
2791
+ `nomic-embed-text-v1.5` is an improvement upon [Nomic Embed](https://huggingface.co/nomic-ai/nomic-embed-text-v1) that utilizes [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) which gives developers the flexibility to trade off the embedding size for a negligible reduction in performance.
2792
+
2793
+
2794
+ | Name | SeqLen | Dimension | MTEB |
2795
+ | :-------------------------------:| :----- | :-------- | :------: |
2796
+ | nomic-embed-text-v1 | 8192 | 768 | **62.39** |
2797
+ | nomic-embed-text-v1.5 | 8192 | 768 | 62.28 |
2798
+ | nomic-embed-text-v1.5 | 8192 | 512 | 61.96 |
2799
+ | nomic-embed-text-v1.5 | 8192 | 256 | 61.04 |
2800
+ | nomic-embed-text-v1.5 | 8192 | 128 | 59.34 |
2801
+ | nomic-embed-text-v1.5 | 8192 | 64 | 56.10 |
2802
+
2803
+
2804
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/CRnaHV-c2wMUMZKw72q85.png)
2805
+
2806
+ ## Training
2807
+ Click the Nomic Atlas map below to visualize a 5M sample of our contrastive pretraining data!
2808
+
2809
+ [![image/webp](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/pjhJhuNyRfPagRd_c_iUz.webp)](https://atlas.nomic.ai/map/nomic-text-embed-v1-5m-sample)
2810
+
2811
+ We train our embedder using a multi-stage training pipeline. Starting from a long-context [BERT model](https://huggingface.co/nomic-ai/nomic-bert-2048),
2812
+ the first unsupervised contrastive stage trains on a dataset generated from weakly related text pairs, such as question-answer pairs from forums like StackExchange and Quora, title-body pairs from Amazon reviews, and summarizations from news articles.
2813
+
2814
+ In the second finetuning stage, higher quality labeled datasets such as search queries and answers from web searches are leveraged. Data curation and hard-example mining is crucial in this stage.
2815
+
2816
+ For more details, see the Nomic Embed [Technical Report](https://static.nomic.ai/reports/2024_Nomic_Embed_Text_Technical_Report.pdf) and corresponding [blog post](https://blog.nomic.ai/posts/nomic-embed-matryoshka).
2817
+
2818
+ Training data to train the models is released in its entirety. For more details, see the `contrastors` [repository](https://github.com/nomic-ai/contrastors)
2819
+
2820
+
2821
  # Join the Nomic Community
2822
 
2823
  - Nomic: [https://nomic.ai](https://nomic.ai)