diptanuc commited on
Commit
befcf4e
1 Parent(s): e8ad846

fixed the datasets file

Browse files
Files changed (1) hide show
  1. README.md +9 -10
README.md CHANGED
@@ -9,7 +9,6 @@ license: apache-2.0
9
  datasets:
10
  - s2orc
11
  - flax-sentence-embeddings/stackexchange_xml
12
- - MS Marco
13
  - gooaq
14
  - yahoo_answers_topics
15
  - code_search_net
@@ -99,18 +98,18 @@ For an automated evaluation of this model, see the *Sentence Embeddings Benchmar
99
 
100
  ## Background
101
 
102
- The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
103
- contrastive learning objective. We used the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model and fine-tuned in on a
104
  1B sentence pairs dataset. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
105
 
106
- We developped this model during the
107
- [Community week using JAX/Flax for NLP & CV](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104),
108
  organized by Hugging Face. We developped this model as part of the project:
109
  [Train the Best Sentence Embedding Model Ever with 1B Training Pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks.
110
 
111
  ## Intended uses
112
 
113
- Our model is intented to be used as a sentence and short paragraph encoder. Given an input text, it ouptuts a vector which captures
114
  the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.
115
 
116
  By default, input text longer than 384 word pieces is truncated.
@@ -118,11 +117,11 @@ By default, input text longer than 384 word pieces is truncated.
118
 
119
  ## Training procedure
120
 
121
- ### Pre-training
122
 
123
  We use the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model. Please refer to the model card for more detailed information about the pre-training procedure.
124
 
125
- ### Fine-tuning
126
 
127
  We fine-tune the model using a contrastive objective. Formally, we compute the cosine similarity from each possible sentence pairs from the batch.
128
  We then apply the cross entropy loss by comparing with true pairs.
@@ -162,7 +161,7 @@ We sampled each dataset given a weighted probability which configuration is deta
162
  | [Eli5](https://huggingface.co/datasets/eli5) | [paper](https://doi.org/10.18653/v1/p19-1346) | 325,475 |
163
  | [Flickr 30k](https://shannon.cs.illinois.edu/DenotationGraph/) | [paper](https://transacl.org/ojs/index.php/tacl/article/view/229/33) | 317,695 |
164
  | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles) | | 304,525 |
165
- | AllNLI ([SNLI](https://nlp.stanford.edu/projects/snli/) and [MultiNLI](https://cims.nyu.edu/~sbowman/multinli/) | [paper SNLI](https://doi.org/10.18653/v1/d15-1075), [paper MultiNLI](https://doi.org/10.18653/v1/n18-1101) | 277,230 |
166
  | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (bodies) | | 250,519 |
167
  | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles+bodies) | | 250,460 |
168
  | [Sentence Compression](https://github.com/google-research-datasets/sentence-compression) | [paper](https://www.aclweb.org/anthology/D13-1155/) | 180,000 |
@@ -173,4 +172,4 @@ We sampled each dataset given a weighted probability which configuration is deta
173
  | [Natural Questions (NQ)](https://ai.google.com/research/NaturalQuestions) | [paper](https://transacl.org/ojs/index.php/tacl/article/view/1455) | 100,231 |
174
  | [SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/) | [paper](https://aclanthology.org/P18-2124.pdf) | 87,599 |
175
  | [TriviaQA](https://huggingface.co/datasets/trivia_qa) | - | 73,346 |
176
- | **Total** | | **1,170,060,424** |
 
9
  datasets:
10
  - s2orc
11
  - flax-sentence-embeddings/stackexchange_xml
 
12
  - gooaq
13
  - yahoo_answers_topics
14
  - code_search_net
 
98
 
99
  ## Background
100
 
101
+ The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
102
+ contrastive learning objective. We used the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model and fine-tuned in on a
103
  1B sentence pairs dataset. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
104
 
105
+ We developped this model during the
106
+ [Community week using JAX/Flax for NLP & CV](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104),
107
  organized by Hugging Face. We developped this model as part of the project:
108
  [Train the Best Sentence Embedding Model Ever with 1B Training Pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks.
109
 
110
  ## Intended uses
111
 
112
+ Our model is intented to be used as a sentence and short paragraph encoder. Given an input text, it ouptuts a vector which captures
113
  the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.
114
 
115
  By default, input text longer than 384 word pieces is truncated.
 
117
 
118
  ## Training procedure
119
 
120
+ ### Pre-training
121
 
122
  We use the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model. Please refer to the model card for more detailed information about the pre-training procedure.
123
 
124
+ ### Fine-tuning
125
 
126
  We fine-tune the model using a contrastive objective. Formally, we compute the cosine similarity from each possible sentence pairs from the batch.
127
  We then apply the cross entropy loss by comparing with true pairs.
 
161
  | [Eli5](https://huggingface.co/datasets/eli5) | [paper](https://doi.org/10.18653/v1/p19-1346) | 325,475 |
162
  | [Flickr 30k](https://shannon.cs.illinois.edu/DenotationGraph/) | [paper](https://transacl.org/ojs/index.php/tacl/article/view/229/33) | 317,695 |
163
  | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles) | | 304,525 |
164
+ | AllNLI ([SNLI](https://nlp.stanford.edu/projects/snli/) and [MultiNLI](https://cims.nyu.edu/~sbowman/multinli/) | [paper SNLI](https://doi.org/10.18653/v1/d15-1075), [paper MultiNLI](https://doi.org/10.18653/v1/n18-1101) | 277,230 |
165
  | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (bodies) | | 250,519 |
166
  | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles+bodies) | | 250,460 |
167
  | [Sentence Compression](https://github.com/google-research-datasets/sentence-compression) | [paper](https://www.aclweb.org/anthology/D13-1155/) | 180,000 |
 
172
  | [Natural Questions (NQ)](https://ai.google.com/research/NaturalQuestions) | [paper](https://transacl.org/ojs/index.php/tacl/article/view/1455) | 100,231 |
173
  | [SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/) | [paper](https://aclanthology.org/P18-2124.pdf) | 87,599 |
174
  | [TriviaQA](https://huggingface.co/datasets/trivia_qa) | - | 73,346 |
175
+ | **Total** | | **1,170,060,424** |