zpn commited on
Commit
c9c6080
·
1 Parent(s): 24800f1

docs: citation etc

Browse files
Files changed (1) hide show
  1. README.md +38 -1
README.md CHANGED
@@ -2956,4 +2956,41 @@ embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
2956
  embeddings = embeddings[:, :matryoshka_dim]
2957
  embeddings = F.normalize(embeddings, p=2, dim=1)
2958
  print(embeddings)
2959
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2956
  embeddings = embeddings[:, :matryoshka_dim]
2957
  embeddings = F.normalize(embeddings, p=2, dim=1)
2958
  print(embeddings)
2959
+ ```
2960
+
2961
+ ## Training
2962
+
2963
+ Click the Nomic Atlas map below to visualize a 5M sample of our contrastive pretraining data!
2964
+
2965
+ [![image/webp](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/pjhJhuNyRfPagRd_c_iUz.webp)](https://atlas.nomic.ai/map/nomic-text-embed-v1-5m-sample)
2966
+
2967
+ We train our embedder using a multi-stage training pipeline. Starting from a long-context [BERT model](https://huggingface.co/nomic-ai/nomic-bert-2048),
2968
+ the first unsupervised contrastive stage trains on a dataset generated from weakly related text pairs, such as question-answer pairs from forums like StackExchange and Quora, title-body pairs from Amazon reviews, and summarizations from news articles.
2969
+
2970
+ In the second finetuning stage, higher quality labeled datasets such as search queries and answers from web searches are leveraged. Data curation and hard-example mining is crucial in this stage.
2971
+
2972
+ For more details, see the Nomic Embed [Technical Report](https://static.nomic.ai/reports/2024_Nomic_Embed_Text_Technical_Report.pdf) and corresponding [blog post](https://blog.nomic.ai/posts/nomic-embed-text-v1).
2973
+
2974
+ Training data to train the models is released in its entirety. For more details, see the `contrastors` [repository](https://github.com/nomic-ai/contrastors)
2975
+
2976
+
2977
+ ## Join the Nomic Community
2978
+
2979
+ - Nomic: [https://nomic.ai](https://nomic.ai)
2980
+ - Discord: [https://discord.gg/myY5YDR8z8](https://discord.gg/myY5YDR8z8)
2981
+ - Twitter: [https://twitter.com/nomic_ai](https://twitter.com/nomic_ai)
2982
+
2983
+ ## Citation
2984
+
2985
+ If you find the model, dataset, or training code useful, please cite our work
2986
+
2987
+ ```bibtex
2988
+ @misc{nussbaum2024nomic,
2989
+ title={Nomic Embed: Training a Reproducible Long Context Text Embedder},
2990
+ author={Zach Nussbaum and John X. Morris and Brandon Duderstadt and Andriy Mulyar},
2991
+ year={2024},
2992
+ eprint={2402.01613},
2993
+ archivePrefix={arXiv},
2994
+ primaryClass={cs.CL}
2995
+ }
2996
+ ```