docs: citation etc
Browse files
README.md
CHANGED
@@ -2956,4 +2956,41 @@ embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
|
|
2956 |
embeddings = embeddings[:, :matryoshka_dim]
|
2957 |
embeddings = F.normalize(embeddings, p=2, dim=1)
|
2958 |
print(embeddings)
|
2959 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2956 |
embeddings = embeddings[:, :matryoshka_dim]
|
2957 |
embeddings = F.normalize(embeddings, p=2, dim=1)
|
2958 |
print(embeddings)
|
2959 |
+
```
|
2960 |
+
|
2961 |
+
## Training
|
2962 |
+
|
2963 |
+
Click the Nomic Atlas map below to visualize a 5M sample of our contrastive pretraining data!
|
2964 |
+
|
2965 |
+
[![image/webp](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/pjhJhuNyRfPagRd_c_iUz.webp)](https://atlas.nomic.ai/map/nomic-text-embed-v1-5m-sample)
|
2966 |
+
|
2967 |
+
We train our embedder using a multi-stage training pipeline. Starting from a long-context [BERT model](https://huggingface.co/nomic-ai/nomic-bert-2048),
|
2968 |
+
the first unsupervised contrastive stage trains on a dataset generated from weakly related text pairs, such as question-answer pairs from forums like StackExchange and Quora, title-body pairs from Amazon reviews, and summarizations from news articles.
|
2969 |
+
|
2970 |
+
In the second finetuning stage, higher quality labeled datasets such as search queries and answers from web searches are leveraged. Data curation and hard-example mining is crucial in this stage.
|
2971 |
+
|
2972 |
+
For more details, see the Nomic Embed [Technical Report](https://static.nomic.ai/reports/2024_Nomic_Embed_Text_Technical_Report.pdf) and corresponding [blog post](https://blog.nomic.ai/posts/nomic-embed-text-v1).
|
2973 |
+
|
2974 |
+
Training data to train the models is released in its entirety. For more details, see the `contrastors` [repository](https://github.com/nomic-ai/contrastors)
|
2975 |
+
|
2976 |
+
|
2977 |
+
## Join the Nomic Community
|
2978 |
+
|
2979 |
+
- Nomic: [https://nomic.ai](https://nomic.ai)
|
2980 |
+
- Discord: [https://discord.gg/myY5YDR8z8](https://discord.gg/myY5YDR8z8)
|
2981 |
+
- Twitter: [https://twitter.com/nomic_ai](https://twitter.com/nomic_ai)
|
2982 |
+
|
2983 |
+
## Citation
|
2984 |
+
|
2985 |
+
If you find the model, dataset, or training code useful, please cite our work
|
2986 |
+
|
2987 |
+
```bibtex
|
2988 |
+
@misc{nussbaum2024nomic,
|
2989 |
+
title={Nomic Embed: Training a Reproducible Long Context Text Embedder},
|
2990 |
+
author={Zach Nussbaum and John X. Morris and Brandon Duderstadt and Andriy Mulyar},
|
2991 |
+
year={2024},
|
2992 |
+
eprint={2402.01613},
|
2993 |
+
archivePrefix={arXiv},
|
2994 |
+
primaryClass={cs.CL}
|
2995 |
+
}
|
2996 |
+
```
|