fixed the datasets file
Browse files
README.md
CHANGED
@@ -9,7 +9,6 @@ license: apache-2.0
|
|
9 |
datasets:
|
10 |
- s2orc
|
11 |
- flax-sentence-embeddings/stackexchange_xml
|
12 |
-
- MS Marco
|
13 |
- gooaq
|
14 |
- yahoo_answers_topics
|
15 |
- code_search_net
|
@@ -99,18 +98,18 @@ For an automated evaluation of this model, see the *Sentence Embeddings Benchmar
|
|
99 |
|
100 |
## Background
|
101 |
|
102 |
-
The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
|
103 |
-
contrastive learning objective. We used the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model and fine-tuned in on a
|
104 |
1B sentence pairs dataset. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
|
105 |
|
106 |
-
We developped this model during the
|
107 |
-
[Community week using JAX/Flax for NLP & CV](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104),
|
108 |
organized by Hugging Face. We developped this model as part of the project:
|
109 |
[Train the Best Sentence Embedding Model Ever with 1B Training Pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks.
|
110 |
|
111 |
## Intended uses
|
112 |
|
113 |
-
Our model is intented to be used as a sentence and short paragraph encoder. Given an input text, it ouptuts a vector which captures
|
114 |
the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.
|
115 |
|
116 |
By default, input text longer than 384 word pieces is truncated.
|
@@ -118,11 +117,11 @@ By default, input text longer than 384 word pieces is truncated.
|
|
118 |
|
119 |
## Training procedure
|
120 |
|
121 |
-
### Pre-training
|
122 |
|
123 |
We use the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model. Please refer to the model card for more detailed information about the pre-training procedure.
|
124 |
|
125 |
-
### Fine-tuning
|
126 |
|
127 |
We fine-tune the model using a contrastive objective. Formally, we compute the cosine similarity from each possible sentence pairs from the batch.
|
128 |
We then apply the cross entropy loss by comparing with true pairs.
|
@@ -162,7 +161,7 @@ We sampled each dataset given a weighted probability which configuration is deta
|
|
162 |
| [Eli5](https://huggingface.co/datasets/eli5) | [paper](https://doi.org/10.18653/v1/p19-1346) | 325,475 |
|
163 |
| [Flickr 30k](https://shannon.cs.illinois.edu/DenotationGraph/) | [paper](https://transacl.org/ojs/index.php/tacl/article/view/229/33) | 317,695 |
|
164 |
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles) | | 304,525 |
|
165 |
-
| AllNLI ([SNLI](https://nlp.stanford.edu/projects/snli/) and [MultiNLI](https://cims.nyu.edu/~sbowman/multinli/) | [paper SNLI](https://doi.org/10.18653/v1/d15-1075), [paper MultiNLI](https://doi.org/10.18653/v1/n18-1101) | 277,230 |
|
166 |
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (bodies) | | 250,519 |
|
167 |
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles+bodies) | | 250,460 |
|
168 |
| [Sentence Compression](https://github.com/google-research-datasets/sentence-compression) | [paper](https://www.aclweb.org/anthology/D13-1155/) | 180,000 |
|
@@ -173,4 +172,4 @@ We sampled each dataset given a weighted probability which configuration is deta
|
|
173 |
| [Natural Questions (NQ)](https://ai.google.com/research/NaturalQuestions) | [paper](https://transacl.org/ojs/index.php/tacl/article/view/1455) | 100,231 |
|
174 |
| [SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/) | [paper](https://aclanthology.org/P18-2124.pdf) | 87,599 |
|
175 |
| [TriviaQA](https://huggingface.co/datasets/trivia_qa) | - | 73,346 |
|
176 |
-
| **Total** | | **1,170,060,424** |
|
|
|
9 |
datasets:
|
10 |
- s2orc
|
11 |
- flax-sentence-embeddings/stackexchange_xml
|
|
|
12 |
- gooaq
|
13 |
- yahoo_answers_topics
|
14 |
- code_search_net
|
|
|
98 |
|
99 |
## Background
|
100 |
|
101 |
+
The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
|
102 |
+
contrastive learning objective. We used the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model and fine-tuned in on a
|
103 |
1B sentence pairs dataset. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
|
104 |
|
105 |
+
We developped this model during the
|
106 |
+
[Community week using JAX/Flax for NLP & CV](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104),
|
107 |
organized by Hugging Face. We developped this model as part of the project:
|
108 |
[Train the Best Sentence Embedding Model Ever with 1B Training Pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks.
|
109 |
|
110 |
## Intended uses
|
111 |
|
112 |
+
Our model is intented to be used as a sentence and short paragraph encoder. Given an input text, it ouptuts a vector which captures
|
113 |
the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.
|
114 |
|
115 |
By default, input text longer than 384 word pieces is truncated.
|
|
|
117 |
|
118 |
## Training procedure
|
119 |
|
120 |
+
### Pre-training
|
121 |
|
122 |
We use the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model. Please refer to the model card for more detailed information about the pre-training procedure.
|
123 |
|
124 |
+
### Fine-tuning
|
125 |
|
126 |
We fine-tune the model using a contrastive objective. Formally, we compute the cosine similarity from each possible sentence pairs from the batch.
|
127 |
We then apply the cross entropy loss by comparing with true pairs.
|
|
|
161 |
| [Eli5](https://huggingface.co/datasets/eli5) | [paper](https://doi.org/10.18653/v1/p19-1346) | 325,475 |
|
162 |
| [Flickr 30k](https://shannon.cs.illinois.edu/DenotationGraph/) | [paper](https://transacl.org/ojs/index.php/tacl/article/view/229/33) | 317,695 |
|
163 |
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles) | | 304,525 |
|
164 |
+
| AllNLI ([SNLI](https://nlp.stanford.edu/projects/snli/) and [MultiNLI](https://cims.nyu.edu/~sbowman/multinli/) | [paper SNLI](https://doi.org/10.18653/v1/d15-1075), [paper MultiNLI](https://doi.org/10.18653/v1/n18-1101) | 277,230 |
|
165 |
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (bodies) | | 250,519 |
|
166 |
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles+bodies) | | 250,460 |
|
167 |
| [Sentence Compression](https://github.com/google-research-datasets/sentence-compression) | [paper](https://www.aclweb.org/anthology/D13-1155/) | 180,000 |
|
|
|
172 |
| [Natural Questions (NQ)](https://ai.google.com/research/NaturalQuestions) | [paper](https://transacl.org/ojs/index.php/tacl/article/view/1455) | 100,231 |
|
173 |
| [SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/) | [paper](https://aclanthology.org/P18-2124.pdf) | 87,599 |
|
174 |
| [TriviaQA](https://huggingface.co/datasets/trivia_qa) | - | 73,346 |
|
175 |
+
| **Total** | | **1,170,060,424** |
|