ramsrigouthamg commited on
Commit
f88ec83
1 Parent(s): 29d26f7

initial commit

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false
7
+ }
README.md ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: feature-extraction
3
+ tags:
4
+ - sentence-transformers
5
+ - feature-extraction
6
+ - sentence-similarity
7
+ language: en
8
+ license: apache-2.0
9
+ ---
10
+
11
+
12
+ # all-mpnet-base-v2
13
+ This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
14
+
15
+ ## Usage (Sentence-Transformers)
16
+ Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
17
+
18
+ ```
19
+ pip install -U sentence-transformers
20
+ ```
21
+
22
+ Then you can use the model like this:
23
+ ```python
24
+ from sentence_transformers import SentenceTransformer
25
+ sentences = ["This is an example sentence", "Each sentence is converted"]
26
+
27
+ model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
28
+ embeddings = model.encode(sentences)
29
+ print(embeddings)
30
+ ```
31
+
32
+ ## Usage (HuggingFace Transformers)
33
+ Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
34
+
35
+ ```python
36
+ from transformers import AutoTokenizer, AutoModel
37
+ import torch
38
+ import torch.nn.functional as F
39
+
40
+ #Mean Pooling - Take attention mask into account for correct averaging
41
+ def mean_pooling(model_output, attention_mask):
42
+ token_embeddings = model_output[0] #First element of model_output contains all token embeddings
43
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
44
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
45
+
46
+
47
+ # Sentences we want sentence embeddings for
48
+ sentences = ['This is an example sentence', 'Each sentence is converted']
49
+
50
+ # Load model from HuggingFace Hub
51
+ tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
52
+ model = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2')
53
+
54
+ # Tokenize sentences
55
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
56
+
57
+ # Compute token embeddings
58
+ with torch.no_grad():
59
+ model_output = model(**encoded_input)
60
+
61
+ # Perform pooling
62
+ sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
63
+
64
+ # Normalize embeddings
65
+ sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
66
+
67
+ print("Sentence embeddings:")
68
+ print(sentence_embeddings)
69
+ ```
70
+
71
+ ## Evaluation Results
72
+
73
+ For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=sentence-transformers/all-mpnet-base-v2)
74
+
75
+ ------
76
+
77
+ ## Background
78
+
79
+ The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
80
+ contrastive learning objective. We used the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model and fine-tuned in on a
81
+ 1B sentence pairs dataset. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
82
+
83
+ We developped this model during the
84
+ [Community week using JAX/Flax for NLP & CV](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104),
85
+ organized by Hugging Face. We developped this model as part of the project:
86
+ [Train the Best Sentence Embedding Model Ever with 1B Training Pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks.
87
+
88
+ ## Intended uses
89
+
90
+ Our model is intented to be used as a sentence and short paragraph encoder. Given an input text, it ouptuts a vector which captures
91
+ the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.
92
+
93
+ By default, input text longer than 384 word pieces is truncated.
94
+
95
+
96
+ ## Training procedure
97
+
98
+ ### Pre-training
99
+
100
+ We use the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model. Please refer to the model card for more detailed information about the pre-training procedure.
101
+
102
+ ### Fine-tuning
103
+
104
+ We fine-tune the model using a contrastive objective. Formally, we compute the cosine similarity from each possible sentence pairs from the batch.
105
+ We then apply the cross entropy loss by comparing with true pairs.
106
+
107
+ #### Hyper parameters
108
+
109
+ We trained ou model on a TPU v3-8. We train the model during 100k steps using a batch size of 1024 (128 per TPU core).
110
+ We use a learning rate warm up of 500. The sequence length was limited to 128 tokens. We used the AdamW optimizer with
111
+ a 2e-5 learning rate. The full training script is accessible in this current repository: `train_script.py`.
112
+
113
+ #### Training data
114
+
115
+ We use the concatenation from multiple datasets to fine-tune our model. The total number of sentence pairs is above 1 billion sentences.
116
+ We sampled each dataset given a weighted probability which configuration is detailed in the `data_config.json` file.
117
+
118
+
119
+ | Dataset | Paper | Number of training tuples |
120
+ |--------------------------------------------------------|:----------------------------------------:|:--------------------------:|
121
+ | [Reddit comments (2015-2018)](https://github.com/PolyAI-LDN/conversational-datasets/tree/master/reddit) | [paper](https://arxiv.org/abs/1904.06472) | 726,484,430 |
122
+ | [S2ORC](https://github.com/allenai/s2orc) Citation pairs (Abstracts) | [paper](https://aclanthology.org/2020.acl-main.447/) | 116,288,806 |
123
+ | [WikiAnswers](https://github.com/afader/oqa#wikianswers-corpus) Duplicate question pairs | [paper](https://doi.org/10.1145/2623330.2623677) | 77,427,422 |
124
+ | [PAQ](https://github.com/facebookresearch/PAQ) (Question, Answer) pairs | [paper](https://arxiv.org/abs/2102.07033) | 64,371,441 |
125
+ | [S2ORC](https://github.com/allenai/s2orc) Citation pairs (Titles) | [paper](https://aclanthology.org/2020.acl-main.447/) | 52,603,982 |
126
+ | [S2ORC](https://github.com/allenai/s2orc) (Title, Abstract) | [paper](https://aclanthology.org/2020.acl-main.447/) | 41,769,185 |
127
+ | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Body) pairs | - | 25,316,456 |
128
+ | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title+Body, Answer) pairs | - | 21,396,559 |
129
+ | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Answer) pairs | - | 21,396,559 |
130
+ | [MS MARCO](https://microsoft.github.io/msmarco/) triplets | [paper](https://doi.org/10.1145/3404835.3462804) | 9,144,553 |
131
+ | [GOOAQ: Open Question Answering with Diverse Answer Types](https://github.com/allenai/gooaq) | [paper](https://arxiv.org/pdf/2104.08727.pdf) | 3,012,496 |
132
+ | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Answer) | [paper](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.html) | 1,198,260 |
133
+ | [Code Search](https://huggingface.co/datasets/code_search_net) | - | 1,151,414 |
134
+ | [COCO](https://cocodataset.org/#home) Image captions | [paper](https://link.springer.com/chapter/10.1007%2F978-3-319-10602-1_48) | 828,395|
135
+ | [SPECTER](https://github.com/allenai/specter) citation triplets | [paper](https://doi.org/10.18653/v1/2020.acl-main.207) | 684,100 |
136
+ | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Question, Answer) | [paper](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.html) | 681,164 |
137
+ | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Question) | [paper](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.html) | 659,896 |
138
+ | [SearchQA](https://huggingface.co/datasets/search_qa) | [paper](https://arxiv.org/abs/1704.05179) | 582,261 |
139
+ | [Eli5](https://huggingface.co/datasets/eli5) | [paper](https://doi.org/10.18653/v1/p19-1346) | 325,475 |
140
+ | [Flickr 30k](https://shannon.cs.illinois.edu/DenotationGraph/) | [paper](https://transacl.org/ojs/index.php/tacl/article/view/229/33) | 317,695 |
141
+ | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles) | | 304,525 |
142
+ | AllNLI ([SNLI](https://nlp.stanford.edu/projects/snli/) and [MultiNLI](https://cims.nyu.edu/~sbowman/multinli/) | [paper SNLI](https://doi.org/10.18653/v1/d15-1075), [paper MultiNLI](https://doi.org/10.18653/v1/n18-1101) | 277,230 |
143
+ | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (bodies) | | 250,519 |
144
+ | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles+bodies) | | 250,460 |
145
+ | [Sentence Compression](https://github.com/google-research-datasets/sentence-compression) | [paper](https://www.aclweb.org/anthology/D13-1155/) | 180,000 |
146
+ | [Wikihow](https://github.com/pvl/wikihow_pairs_dataset) | [paper](https://arxiv.org/abs/1810.09305) | 128,542 |
147
+ | [Altlex](https://github.com/chridey/altlex/) | [paper](https://aclanthology.org/P16-1135.pdf) | 112,696 |
148
+ | [Quora Question Triplets](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs) | - | 103,663 |
149
+ | [Simple Wikipedia](https://cs.pomona.edu/~dkauchak/simplification/) | [paper](https://www.aclweb.org/anthology/P11-2117/) | 102,225 |
150
+ | [Natural Questions (NQ)](https://ai.google.com/research/NaturalQuestions) | [paper](https://transacl.org/ojs/index.php/tacl/article/view/1455) | 100,231 |
151
+ | [SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/) | [paper](https://aclanthology.org/P18-2124.pdf) | 87,599 |
152
+ | [TriviaQA](https://huggingface.co/datasets/trivia_qa) | - | 73,346 |
153
+ | **Total** | | **1,170,060,424** |
config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "microsoft/mpnet-base",
3
+ "architectures": [
4
+ "MPNetForMaskedLM"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-05,
15
+ "max_position_embeddings": 514,
16
+ "model_type": "mpnet",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 1,
20
+ "relative_attention_num_buckets": 32,
21
+ "transformers_version": "4.8.2",
22
+ "vocab_size": 30527
23
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "2.0.0",
4
+ "transformers": "4.6.1",
5
+ "pytorch": "1.8.1"
6
+ }
7
+ }
data_config.json ADDED
@@ -0,0 +1,1452 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "name": "stackexchange_title_body/skeptics.stackexchange.com.jsonl.gz",
4
+ "lines": 10009,
5
+ "weight": 1
6
+ },
7
+ {
8
+ "name": "stackexchange_TitleBody_Answer/islam.stackexchange.com.jsonl.gz",
9
+ "lines": 10052,
10
+ "weight": 1
11
+ },
12
+ {
13
+ "name": "stackexchange_Title_Answer/islam.stackexchange.com.jsonl.gz",
14
+ "lines": 10052,
15
+ "weight": 1
16
+ },
17
+ {
18
+ "name": "stackexchange_TitleBody_Answer/anime.stackexchange.com.jsonl.gz",
19
+ "lines": 10131,
20
+ "weight": 1
21
+ },
22
+ {
23
+ "name": "stackexchange_Title_Answer/anime.stackexchange.com.jsonl.gz",
24
+ "lines": 10131,
25
+ "weight": 1
26
+ },
27
+ {
28
+ "name": "stackexchange_title_body/writers.stackexchange.com.jsonl.gz",
29
+ "lines": 10157,
30
+ "weight": 1
31
+ },
32
+ {
33
+ "name": "stackexchange_title_body/astronomy.stackexchange.com.jsonl.gz",
34
+ "lines": 10462,
35
+ "weight": 1
36
+ },
37
+ {
38
+ "name": "stackexchange_title_body/vi.stackexchange.com.jsonl.gz",
39
+ "lines": 10551,
40
+ "weight": 1
41
+ },
42
+ {
43
+ "name": "stackexchange_TitleBody_Answer/french.stackexchange.com.jsonl.gz",
44
+ "lines": 10578,
45
+ "weight": 1
46
+ },
47
+ {
48
+ "name": "stackexchange_Title_Answer/french.stackexchange.com.jsonl.gz",
49
+ "lines": 10578,
50
+ "weight": 1
51
+ },
52
+ {
53
+ "name": "stackexchange_title_body/cstheory.stackexchange.com.jsonl.gz",
54
+ "lines": 10642,
55
+ "weight": 1
56
+ },
57
+ {
58
+ "name": "stackexchange_TitleBody_Answer/civicrm.stackexchange.com.jsonl.gz",
59
+ "lines": 10648,
60
+ "weight": 1
61
+ },
62
+ {
63
+ "name": "stackexchange_Title_Answer/civicrm.stackexchange.com.jsonl.gz",
64
+ "lines": 10648,
65
+ "weight": 1
66
+ },
67
+ {
68
+ "name": "stackexchange_TitleBody_Answer/expressionengine.stackexchange.com.jsonl.gz",
69
+ "lines": 10742,
70
+ "weight": 1
71
+ },
72
+ {
73
+ "name": "stackexchange_Title_Answer/expressionengine.stackexchange.com.jsonl.gz",
74
+ "lines": 10742,
75
+ "weight": 1
76
+ },
77
+ {
78
+ "name": "stackexchange_title_body/engineering.stackexchange.com.jsonl.gz",
79
+ "lines": 10753,
80
+ "weight": 1
81
+ },
82
+ {
83
+ "name": "stackexchange_TitleBody_Answer/history.stackexchange.com.jsonl.gz",
84
+ "lines": 10766,
85
+ "weight": 1
86
+ },
87
+ {
88
+ "name": "stackexchange_Title_Answer/history.stackexchange.com.jsonl.gz",
89
+ "lines": 10766,
90
+ "weight": 1
91
+ },
92
+ {
93
+ "name": "stackexchange_title_body/french.stackexchange.com.jsonl.gz",
94
+ "lines": 10794,
95
+ "weight": 1
96
+ },
97
+ {
98
+ "name": "stackexchange_TitleBody_Answer/politics.stackexchange.com.jsonl.gz",
99
+ "lines": 11047,
100
+ "weight": 1
101
+ },
102
+ {
103
+ "name": "stackexchange_Title_Answer/politics.stackexchange.com.jsonl.gz",
104
+ "lines": 11047,
105
+ "weight": 1
106
+ },
107
+ {
108
+ "name": "stackexchange_title_body/economics.stackexchange.com.jsonl.gz",
109
+ "lines": 11115,
110
+ "weight": 1
111
+ },
112
+ {
113
+ "name": "stackexchange_TitleBody_Answer/craftcms.stackexchange.com.jsonl.gz",
114
+ "lines": 11236,
115
+ "weight": 1
116
+ },
117
+ {
118
+ "name": "stackexchange_Title_Answer/craftcms.stackexchange.com.jsonl.gz",
119
+ "lines": 11236,
120
+ "weight": 1
121
+ },
122
+ {
123
+ "name": "stackexchange_title_body/anime.stackexchange.com.jsonl.gz",
124
+ "lines": 11444,
125
+ "weight": 1
126
+ },
127
+ {
128
+ "name": "stackexchange_TitleBody_Answer/christianity.stackexchange.com.jsonl.gz",
129
+ "lines": 11498,
130
+ "weight": 1
131
+ },
132
+ {
133
+ "name": "stackexchange_Title_Answer/christianity.stackexchange.com.jsonl.gz",
134
+ "lines": 11498,
135
+ "weight": 1
136
+ },
137
+ {
138
+ "name": "stackexchange_TitleBody_Answer/softwarerecs.stackexchange.com.jsonl.gz",
139
+ "lines": 11761,
140
+ "weight": 1
141
+ },
142
+ {
143
+ "name": "stackexchange_Title_Answer/softwarerecs.stackexchange.com.jsonl.gz",
144
+ "lines": 11761,
145
+ "weight": 1
146
+ },
147
+ {
148
+ "name": "stackexchange_TitleBody_Answer/boardgames.stackexchange.com.jsonl.gz",
149
+ "lines": 11805,
150
+ "weight": 1
151
+ },
152
+ {
153
+ "name": "stackexchange_Title_Answer/boardgames.stackexchange.com.jsonl.gz",
154
+ "lines": 11805,
155
+ "weight": 1
156
+ },
157
+ {
158
+ "name": "stackexchange_title_body/islam.stackexchange.com.jsonl.gz",
159
+ "lines": 11853,
160
+ "weight": 1
161
+ },
162
+ {
163
+ "name": "stackexchange_title_body/expressionengine.stackexchange.com.jsonl.gz",
164
+ "lines": 11866,
165
+ "weight": 1
166
+ },
167
+ {
168
+ "name": "stackexchange_title_body/politics.stackexchange.com.jsonl.gz",
169
+ "lines": 11894,
170
+ "weight": 1
171
+ },
172
+ {
173
+ "name": "stackexchange_title_body/history.stackexchange.com.jsonl.gz",
174
+ "lines": 12021,
175
+ "weight": 1
176
+ },
177
+ {
178
+ "name": "stackexchange_title_body/christianity.stackexchange.com.jsonl.gz",
179
+ "lines": 12108,
180
+ "weight": 1
181
+ },
182
+ {
183
+ "name": "stackexchange_title_body/boardgames.stackexchange.com.jsonl.gz",
184
+ "lines": 12149,
185
+ "weight": 1
186
+ },
187
+ {
188
+ "name": "flickr30k_captions.jsonl.gz",
189
+ "lines": 317695,
190
+ "weight": 1
191
+ },
192
+ {
193
+ "name": "coco_captions.jsonl.gz",
194
+ "lines": 828395,
195
+ "weight": 1
196
+ },
197
+ {
198
+ "name": "codesearchnet.jsonl.gz",
199
+ "lines": 1151414,
200
+ "weight": 1
201
+ },
202
+ {
203
+ "name": "stackexchange_title_body/civicrm.stackexchange.com.jsonl.gz",
204
+ "lines": 12543,
205
+ "weight": 2
206
+ },
207
+ {
208
+ "name": "stackexchange_title_body/craftcms.stackexchange.com.jsonl.gz",
209
+ "lines": 12574,
210
+ "weight": 2
211
+ },
212
+ {
213
+ "name": "stackexchange_TitleBody_Answer/networkengineering.stackexchange.com.jsonl.gz",
214
+ "lines": 12590,
215
+ "weight": 2
216
+ },
217
+ {
218
+ "name": "stackexchange_Title_Answer/networkengineering.stackexchange.com.jsonl.gz",
219
+ "lines": 12590,
220
+ "weight": 2
221
+ },
222
+ {
223
+ "name": "stackexchange_TitleBody_Answer/space.stackexchange.com.jsonl.gz",
224
+ "lines": 12893,
225
+ "weight": 2
226
+ },
227
+ {
228
+ "name": "stackexchange_Title_Answer/space.stackexchange.com.jsonl.gz",
229
+ "lines": 12893,
230
+ "weight": 2
231
+ },
232
+ {
233
+ "name": "stackexchange_TitleBody_Answer/quant.stackexchange.com.jsonl.gz",
234
+ "lines": 12933,
235
+ "weight": 2
236
+ },
237
+ {
238
+ "name": "stackexchange_Title_Answer/quant.stackexchange.com.jsonl.gz",
239
+ "lines": 12933,
240
+ "weight": 2
241
+ },
242
+ {
243
+ "name": "stackexchange_TitleBody_Answer/philosophy.stackexchange.com.jsonl.gz",
244
+ "lines": 13114,
245
+ "weight": 2
246
+ },
247
+ {
248
+ "name": "stackexchange_Title_Answer/philosophy.stackexchange.com.jsonl.gz",
249
+ "lines": 13114,
250
+ "weight": 2
251
+ },
252
+ {
253
+ "name": "stackexchange_TitleBody_Answer/gardening.stackexchange.com.jsonl.gz",
254
+ "lines": 13246,
255
+ "weight": 2
256
+ },
257
+ {
258
+ "name": "stackexchange_Title_Answer/gardening.stackexchange.com.jsonl.gz",
259
+ "lines": 13246,
260
+ "weight": 2
261
+ },
262
+ {
263
+ "name": "stackexchange_title_body/hinduism.stackexchange.com.jsonl.gz",
264
+ "lines": 13450,
265
+ "weight": 2
266
+ },
267
+ {
268
+ "name": "stackexchange_title_body/networkengineering.stackexchange.com.jsonl.gz",
269
+ "lines": 13454,
270
+ "weight": 2
271
+ },
272
+ {
273
+ "name": "stackexchange_TitleBody_Answer/german.stackexchange.com.jsonl.gz",
274
+ "lines": 13733,
275
+ "weight": 2
276
+ },
277
+ {
278
+ "name": "stackexchange_Title_Answer/german.stackexchange.com.jsonl.gz",
279
+ "lines": 13733,
280
+ "weight": 2
281
+ },
282
+ {
283
+ "name": "stackexchange_title_body/german.stackexchange.com.jsonl.gz",
284
+ "lines": 13950,
285
+ "weight": 2
286
+ },
287
+ {
288
+ "name": "stackexchange_title_body/philosophy.stackexchange.com.jsonl.gz",
289
+ "lines": 14829,
290
+ "weight": 2
291
+ },
292
+ {
293
+ "name": "stackexchange_title_body/gardening.stackexchange.com.jsonl.gz",
294
+ "lines": 15136,
295
+ "weight": 2
296
+ },
297
+ {
298
+ "name": "stackexchange_title_body/space.stackexchange.com.jsonl.gz",
299
+ "lines": 15142,
300
+ "weight": 2
301
+ },
302
+ {
303
+ "name": "stackexchange_TitleBody_Answer/bicycles.stackexchange.com.jsonl.gz",
304
+ "lines": 15708,
305
+ "weight": 2
306
+ },
307
+ {
308
+ "name": "stackexchange_Title_Answer/bicycles.stackexchange.com.jsonl.gz",
309
+ "lines": 15708,
310
+ "weight": 2
311
+ },
312
+ {
313
+ "name": "stackexchange_TitleBody_Answer/law.stackexchange.com.jsonl.gz",
314
+ "lines": 16133,
315
+ "weight": 2
316
+ },
317
+ {
318
+ "name": "stackexchange_Title_Answer/law.stackexchange.com.jsonl.gz",
319
+ "lines": 16133,
320
+ "weight": 2
321
+ },
322
+ {
323
+ "name": "stackexchange_TitleBody_Answer/arduino.stackexchange.com.jsonl.gz",
324
+ "lines": 16281,
325
+ "weight": 2
326
+ },
327
+ {
328
+ "name": "stackexchange_Title_Answer/arduino.stackexchange.com.jsonl.gz",
329
+ "lines": 16281,
330
+ "weight": 2
331
+ },
332
+ {
333
+ "name": "stackexchange_title_body/bicycles.stackexchange.com.jsonl.gz",
334
+ "lines": 16353,
335
+ "weight": 2
336
+ },
337
+ {
338
+ "name": "stackexchange_TitleBody_Answer/emacs.stackexchange.com.jsonl.gz",
339
+ "lines": 16830,
340
+ "weight": 2
341
+ },
342
+ {
343
+ "name": "stackexchange_Title_Answer/emacs.stackexchange.com.jsonl.gz",
344
+ "lines": 16830,
345
+ "weight": 2
346
+ },
347
+ {
348
+ "name": "stackexchange_title_body/quant.stackexchange.com.jsonl.gz",
349
+ "lines": 17261,
350
+ "weight": 2
351
+ },
352
+ {
353
+ "name": "stackexchange_TitleBody_Answer/dsp.stackexchange.com.jsonl.gz",
354
+ "lines": 17430,
355
+ "weight": 2
356
+ },
357
+ {
358
+ "name": "stackexchange_Title_Answer/dsp.stackexchange.com.jsonl.gz",
359
+ "lines": 17430,
360
+ "weight": 2
361
+ },
362
+ {
363
+ "name": "stackexchange_TitleBody_Answer/puzzling.stackexchange.com.jsonl.gz",
364
+ "lines": 17448,
365
+ "weight": 2
366
+ },
367
+ {
368
+ "name": "stackexchange_Title_Answer/puzzling.stackexchange.com.jsonl.gz",
369
+ "lines": 17448,
370
+ "weight": 2
371
+ },
372
+ {
373
+ "name": "stackexchange_title_body/puzzling.stackexchange.com.jsonl.gz",
374
+ "lines": 17851,
375
+ "weight": 2
376
+ },
377
+ {
378
+ "name": "stackexchange_title_body/law.stackexchange.com.jsonl.gz",
379
+ "lines": 17941,
380
+ "weight": 2
381
+ },
382
+ {
383
+ "name": "stackexchange_TitleBody_Answer/movies.stackexchange.com.jsonl.gz",
384
+ "lines": 18243,
385
+ "weight": 2
386
+ },
387
+ {
388
+ "name": "stackexchange_Title_Answer/movies.stackexchange.com.jsonl.gz",
389
+ "lines": 18243,
390
+ "weight": 2
391
+ },
392
+ {
393
+ "name": "stackexchange_TitleBody_Answer/mechanics.stackexchange.com.jsonl.gz",
394
+ "lines": 18613,
395
+ "weight": 2
396
+ },
397
+ {
398
+ "name": "stackexchange_Title_Answer/mechanics.stackexchange.com.jsonl.gz",
399
+ "lines": 18613,
400
+ "weight": 2
401
+ },
402
+ {
403
+ "name": "stackexchange_TitleBody_Answer/aviation.stackexchange.com.jsonl.gz",
404
+ "lines": 18755,
405
+ "weight": 2
406
+ },
407
+ {
408
+ "name": "stackexchange_Title_Answer/aviation.stackexchange.com.jsonl.gz",
409
+ "lines": 18755,
410
+ "weight": 2
411
+ },
412
+ {
413
+ "name": "stackexchange_TitleBody_Answer/biology.stackexchange.com.jsonl.gz",
414
+ "lines": 19277,
415
+ "weight": 2
416
+ },
417
+ {
418
+ "name": "stackexchange_Title_Answer/biology.stackexchange.com.jsonl.gz",
419
+ "lines": 19277,
420
+ "weight": 2
421
+ },
422
+ {
423
+ "name": "stackexchange_TitleBody_Answer/crypto.stackexchange.com.jsonl.gz",
424
+ "lines": 19404,
425
+ "weight": 2
426
+ },
427
+ {
428
+ "name": "stackexchange_Title_Answer/crypto.stackexchange.com.jsonl.gz",
429
+ "lines": 19404,
430
+ "weight": 2
431
+ },
432
+ {
433
+ "name": "stackexchange_title_body/arduino.stackexchange.com.jsonl.gz",
434
+ "lines": 19553,
435
+ "weight": 2
436
+ },
437
+ {
438
+ "name": "stackexchange_TitleBody_Answer/music.stackexchange.com.jsonl.gz",
439
+ "lines": 19936,
440
+ "weight": 2
441
+ },
442
+ {
443
+ "name": "stackexchange_Title_Answer/music.stackexchange.com.jsonl.gz",
444
+ "lines": 19936,
445
+ "weight": 2
446
+ },
447
+ {
448
+ "name": "stackexchange_title_body/aviation.stackexchange.com.jsonl.gz",
449
+ "lines": 20139,
450
+ "weight": 2
451
+ },
452
+ {
453
+ "name": "stackexchange_title_body/softwarerecs.stackexchange.com.jsonl.gz",
454
+ "lines": 20142,
455
+ "weight": 2
456
+ },
457
+ {
458
+ "name": "stackexchange_title_body/movies.stackexchange.com.jsonl.gz",
459
+ "lines": 20181,
460
+ "weight": 2
461
+ },
462
+ {
463
+ "name": "stackexchange_TitleBody_Answer/datascience.stackexchange.com.jsonl.gz",
464
+ "lines": 20503,
465
+ "weight": 2
466
+ },
467
+ {
468
+ "name": "stackexchange_Title_Answer/datascience.stackexchange.com.jsonl.gz",
469
+ "lines": 20503,
470
+ "weight": 2
471
+ },
472
+ {
473
+ "name": "stackexchange_title_body/music.stackexchange.com.jsonl.gz",
474
+ "lines": 20636,
475
+ "weight": 2
476
+ },
477
+ {
478
+ "name": "stackexchange_TitleBody_Answer/japanese.stackexchange.com.jsonl.gz",
479
+ "lines": 20948,
480
+ "weight": 2
481
+ },
482
+ {
483
+ "name": "stackexchange_Title_Answer/japanese.stackexchange.com.jsonl.gz",
484
+ "lines": 20948,
485
+ "weight": 2
486
+ },
487
+ {
488
+ "name": "stackexchange_title_body/emacs.stackexchange.com.jsonl.gz",
489
+ "lines": 21055,
490
+ "weight": 2
491
+ },
492
+ {
493
+ "name": "stackexchange_title_body/dsp.stackexchange.com.jsonl.gz",
494
+ "lines": 21252,
495
+ "weight": 2
496
+ },
497
+ {
498
+ "name": "stackexchange_title_body/japanese.stackexchange.com.jsonl.gz",
499
+ "lines": 22056,
500
+ "weight": 2
501
+ },
502
+ {
503
+ "name": "stackexchange_TitleBody_Answer/bitcoin.stackexchange.com.jsonl.gz",
504
+ "lines": 22474,
505
+ "weight": 2
506
+ },
507
+ {
508
+ "name": "stackexchange_Title_Answer/bitcoin.stackexchange.com.jsonl.gz",
509
+ "lines": 22474,
510
+ "weight": 2
511
+ },
512
+ {
513
+ "name": "stackexchange_TitleBody_Answer/cooking.stackexchange.com.jsonl.gz",
514
+ "lines": 22641,
515
+ "weight": 2
516
+ },
517
+ {
518
+ "name": "stackexchange_Title_Answer/cooking.stackexchange.com.jsonl.gz",
519
+ "lines": 22641,
520
+ "weight": 2
521
+ },
522
+ {
523
+ "name": "stackexchange_title_body/mechanics.stackexchange.com.jsonl.gz",
524
+ "lines": 22868,
525
+ "weight": 2
526
+ },
527
+ {
528
+ "name": "stackexchange_TitleBody_Answer/photo.stackexchange.com.jsonl.gz",
529
+ "lines": 23204,
530
+ "weight": 2
531
+ },
532
+ {
533
+ "name": "stackexchange_Title_Answer/photo.stackexchange.com.jsonl.gz",
534
+ "lines": 23204,
535
+ "weight": 2
536
+ },
537
+ {
538
+ "name": "stackexchange_title_body/crypto.stackexchange.com.jsonl.gz",
539
+ "lines": 23231,
540
+ "weight": 2
541
+ },
542
+ {
543
+ "name": "stackexchange_title_body/cooking.stackexchange.com.jsonl.gz",
544
+ "lines": 23705,
545
+ "weight": 2
546
+ },
547
+ {
548
+ "name": "stackexchange_title_body/photo.stackexchange.com.jsonl.gz",
549
+ "lines": 23753,
550
+ "weight": 2
551
+ },
552
+ {
553
+ "name": "stackexchange_TitleBody_Answer/workplace.stackexchange.com.jsonl.gz",
554
+ "lines": 24012,
555
+ "weight": 2
556
+ },
557
+ {
558
+ "name": "stackexchange_Title_Answer/workplace.stackexchange.com.jsonl.gz",
559
+ "lines": 24012,
560
+ "weight": 2
561
+ },
562
+ {
563
+ "name": "stackexchange_TitleBody_Answer/meta.stackoverflow.com.jsonl.gz",
564
+ "lines": 24044,
565
+ "weight": 2
566
+ },
567
+ {
568
+ "name": "stackexchange_Title_Answer/meta.stackoverflow.com.jsonl.gz",
569
+ "lines": 24044,
570
+ "weight": 2
571
+ },
572
+ {
573
+ "name": "stackexchange_TitleBody_Answer/raspberrypi.stackexchange.com.jsonl.gz",
574
+ "lines": 24143,
575
+ "weight": 2
576
+ },
577
+ {
578
+ "name": "stackexchange_Title_Answer/raspberrypi.stackexchange.com.jsonl.gz",
579
+ "lines": 24143,
580
+ "weight": 2
581
+ },
582
+ {
583
+ "name": "stackexchange_title_body/workplace.stackexchange.com.jsonl.gz",
584
+ "lines": 24189,
585
+ "weight": 2
586
+ },
587
+ {
588
+ "name": "stackexchange_title_body/biology.stackexchange.com.jsonl.gz",
589
+ "lines": 24447,
590
+ "weight": 3
591
+ },
592
+ {
593
+ "name": "stackexchange_TitleBody_Answer/webapps.stackexchange.com.jsonl.gz",
594
+ "lines": 24867,
595
+ "weight": 3
596
+ },
597
+ {
598
+ "name": "stackexchange_Title_Answer/webapps.stackexchange.com.jsonl.gz",
599
+ "lines": 24867,
600
+ "weight": 3
601
+ },
602
+ {
603
+ "name": "stackexchange_title_body/bitcoin.stackexchange.com.jsonl.gz",
604
+ "lines": 25374,
605
+ "weight": 3
606
+ },
607
+ {
608
+ "name": "stackexchange_TitleBody_Answer/judaism.stackexchange.com.jsonl.gz",
609
+ "lines": 26085,
610
+ "weight": 3
611
+ },
612
+ {
613
+ "name": "stackexchange_Title_Answer/judaism.stackexchange.com.jsonl.gz",
614
+ "lines": 26085,
615
+ "weight": 3
616
+ },
617
+ {
618
+ "name": "stackexchange_TitleBody_Answer/ethereum.stackexchange.com.jsonl.gz",
619
+ "lines": 26124,
620
+ "weight": 3
621
+ },
622
+ {
623
+ "name": "stackexchange_Title_Answer/ethereum.stackexchange.com.jsonl.gz",
624
+ "lines": 26124,
625
+ "weight": 3
626
+ },
627
+ {
628
+ "name": "stackexchange_TitleBody_Answer/worldbuilding.stackexchange.com.jsonl.gz",
629
+ "lines": 26210,
630
+ "weight": 3
631
+ },
632
+ {
633
+ "name": "stackexchange_Title_Answer/worldbuilding.stackexchange.com.jsonl.gz",
634
+ "lines": 26210,
635
+ "weight": 3
636
+ },
637
+ {
638
+ "name": "stackexchange_title_body/worldbuilding.stackexchange.com.jsonl.gz",
639
+ "lines": 26763,
640
+ "weight": 3
641
+ },
642
+ {
643
+ "name": "stackexchange_TitleBody_Answer/chemistry.stackexchange.com.jsonl.gz",
644
+ "lines": 27061,
645
+ "weight": 3
646
+ },
647
+ {
648
+ "name": "stackexchange_Title_Answer/chemistry.stackexchange.com.jsonl.gz",
649
+ "lines": 27061,
650
+ "weight": 3
651
+ },
652
+ {
653
+ "name": "stackexchange_title_body/datascience.stackexchange.com.jsonl.gz",
654
+ "lines": 27397,
655
+ "weight": 3
656
+ },
657
+ {
658
+ "name": "stackexchange_TitleBody_Answer/graphicdesign.stackexchange.com.jsonl.gz",
659
+ "lines": 28083,
660
+ "weight": 3
661
+ },
662
+ {
663
+ "name": "stackexchange_Title_Answer/graphicdesign.stackexchange.com.jsonl.gz",
664
+ "lines": 28083,
665
+ "weight": 3
666
+ },
667
+ {
668
+ "name": "stackexchange_TitleBody_Answer/ux.stackexchange.com.jsonl.gz",
669
+ "lines": 28901,
670
+ "weight": 3
671
+ },
672
+ {
673
+ "name": "stackexchange_Title_Answer/ux.stackexchange.com.jsonl.gz",
674
+ "lines": 28901,
675
+ "weight": 3
676
+ },
677
+ {
678
+ "name": "stackexchange_title_body/ux.stackexchange.com.jsonl.gz",
679
+ "lines": 29403,
680
+ "weight": 3
681
+ },
682
+ {
683
+ "name": "stackexchange_TitleBody_Answer/money.stackexchange.com.jsonl.gz",
684
+ "lines": 29404,
685
+ "weight": 3
686
+ },
687
+ {
688
+ "name": "stackexchange_Title_Answer/money.stackexchange.com.jsonl.gz",
689
+ "lines": 29404,
690
+ "weight": 3
691
+ },
692
+ {
693
+ "name": "stackexchange_title_body/webapps.stackexchange.com.jsonl.gz",
694
+ "lines": 29697,
695
+ "weight": 3
696
+ },
697
+ {
698
+ "name": "stackexchange_TitleBody_Answer/cs.stackexchange.com.jsonl.gz",
699
+ "lines": 30010,
700
+ "weight": 3
701
+ },
702
+ {
703
+ "name": "stackexchange_Title_Answer/cs.stackexchange.com.jsonl.gz",
704
+ "lines": 30010,
705
+ "weight": 3
706
+ },
707
+ {
708
+ "name": "stackexchange_title_body/graphicdesign.stackexchange.com.jsonl.gz",
709
+ "lines": 30233,
710
+ "weight": 3
711
+ },
712
+ {
713
+ "name": "stackexchange_TitleBody_Answer/webmasters.stackexchange.com.jsonl.gz",
714
+ "lines": 30370,
715
+ "weight": 3
716
+ },
717
+ {
718
+ "name": "stackexchange_Title_Answer/webmasters.stackexchange.com.jsonl.gz",
719
+ "lines": 30370,
720
+ "weight": 3
721
+ },
722
+ {
723
+ "name": "stackexchange_title_body/raspberrypi.stackexchange.com.jsonl.gz",
724
+ "lines": 30625,
725
+ "weight": 3
726
+ },
727
+ {
728
+ "name": "stackexchange_title_body/money.stackexchange.com.jsonl.gz",
729
+ "lines": 32021,
730
+ "weight": 3
731
+ },
732
+ {
733
+ "name": "stackexchange_title_body/judaism.stackexchange.com.jsonl.gz",
734
+ "lines": 32028,
735
+ "weight": 3
736
+ },
737
+ {
738
+ "name": "stackexchange_TitleBody_Answer/academia.stackexchange.com.jsonl.gz",
739
+ "lines": 32137,
740
+ "weight": 3
741
+ },
742
+ {
743
+ "name": "stackexchange_Title_Answer/academia.stackexchange.com.jsonl.gz",
744
+ "lines": 32137,
745
+ "weight": 3
746
+ },
747
+ {
748
+ "name": "stackexchange_title_body/ethereum.stackexchange.com.jsonl.gz",
749
+ "lines": 32760,
750
+ "weight": 3
751
+ },
752
+ {
753
+ "name": "stackexchange_title_body/academia.stackexchange.com.jsonl.gz",
754
+ "lines": 34331,
755
+ "weight": 3
756
+ },
757
+ {
758
+ "name": "stackexchange_title_body/chemistry.stackexchange.com.jsonl.gz",
759
+ "lines": 34506,
760
+ "weight": 3
761
+ },
762
+ {
763
+ "name": "stackexchange_title_body/webmasters.stackexchange.com.jsonl.gz",
764
+ "lines": 34559,
765
+ "weight": 3
766
+ },
767
+ {
768
+ "name": "stackexchange_title_body/meta.stackoverflow.com.jsonl.gz",
769
+ "lines": 36456,
770
+ "weight": 3
771
+ },
772
+ {
773
+ "name": "stackexchange_TitleBody_Answer/travel.stackexchange.com.jsonl.gz",
774
+ "lines": 36533,
775
+ "weight": 4
776
+ },
777
+ {
778
+ "name": "stackexchange_Title_Answer/travel.stackexchange.com.jsonl.gz",
779
+ "lines": 36533,
780
+ "weight": 4
781
+ },
782
+ {
783
+ "name": "stackexchange_TitleBody_Answer/android.stackexchange.com.jsonl.gz",
784
+ "lines": 38077,
785
+ "weight": 4
786
+ },
787
+ {
788
+ "name": "stackexchange_Title_Answer/android.stackexchange.com.jsonl.gz",
789
+ "lines": 38077,
790
+ "weight": 4
791
+ },
792
+ {
793
+ "name": "stackexchange_title_body/cs.stackexchange.com.jsonl.gz",
794
+ "lines": 38314,
795
+ "weight": 4
796
+ },
797
+ {
798
+ "name": "stackexchange_TitleBody_Answer/gamedev.stackexchange.com.jsonl.gz",
799
+ "lines": 40154,
800
+ "weight": 4
801
+ },
802
+ {
803
+ "name": "stackexchange_Title_Answer/gamedev.stackexchange.com.jsonl.gz",
804
+ "lines": 40154,
805
+ "weight": 4
806
+ },
807
+ {
808
+ "name": "stackexchange_TitleBody_Answer/rpg.stackexchange.com.jsonl.gz",
809
+ "lines": 40435,
810
+ "weight": 4
811
+ },
812
+ {
813
+ "name": "stackexchange_Title_Answer/rpg.stackexchange.com.jsonl.gz",
814
+ "lines": 40435,
815
+ "weight": 4
816
+ },
817
+ {
818
+ "name": "stackexchange_title_body/travel.stackexchange.com.jsonl.gz",
819
+ "lines": 41227,
820
+ "weight": 4
821
+ },
822
+ {
823
+ "name": "stackexchange_TitleBody_Answer/codereview.stackexchange.com.jsonl.gz",
824
+ "lines": 41748,
825
+ "weight": 4
826
+ },
827
+ {
828
+ "name": "stackexchange_Title_Answer/codereview.stackexchange.com.jsonl.gz",
829
+ "lines": 41748,
830
+ "weight": 4
831
+ },
832
+ {
833
+ "name": "stackexchange_title_body/rpg.stackexchange.com.jsonl.gz",
834
+ "lines": 42303,
835
+ "weight": 4
836
+ },
837
+ {
838
+ "name": "stackexchange_title_body/codereview.stackexchange.com.jsonl.gz",
839
+ "lines": 45765,
840
+ "weight": 4
841
+ },
842
+ {
843
+ "name": "stackexchange_title_body/gamedev.stackexchange.com.jsonl.gz",
844
+ "lines": 46485,
845
+ "weight": 4
846
+ },
847
+ {
848
+ "name": "stackexchange_TitleBody_Answer/softwareengineering.stackexchange.com.jsonl.gz",
849
+ "lines": 51326,
850
+ "weight": 5
851
+ },
852
+ {
853
+ "name": "stackexchange_Title_Answer/softwareengineering.stackexchange.com.jsonl.gz",
854
+ "lines": 51326,
855
+ "weight": 5
856
+ },
857
+ {
858
+ "name": "stackexchange_TitleBody_Answer/security.stackexchange.com.jsonl.gz",
859
+ "lines": 51355,
860
+ "weight": 5
861
+ },
862
+ {
863
+ "name": "stackexchange_Title_Answer/security.stackexchange.com.jsonl.gz",
864
+ "lines": 51355,
865
+ "weight": 5
866
+ },
867
+ {
868
+ "name": "stackexchange_title_body/android.stackexchange.com.jsonl.gz",
869
+ "lines": 51608,
870
+ "weight": 5
871
+ },
872
+ {
873
+ "name": "stackexchange_TitleBody_Answer/diy.stackexchange.com.jsonl.gz",
874
+ "lines": 52896,
875
+ "weight": 5
876
+ },
877
+ {
878
+ "name": "stackexchange_Title_Answer/diy.stackexchange.com.jsonl.gz",
879
+ "lines": 52896,
880
+ "weight": 5
881
+ },
882
+ {
883
+ "name": "stackexchange_title_body/softwareengineering.stackexchange.com.jsonl.gz",
884
+ "lines": 53942,
885
+ "weight": 5
886
+ },
887
+ {
888
+ "name": "stackexchange_TitleBody_Answer/blender.stackexchange.com.jsonl.gz",
889
+ "lines": 54153,
890
+ "weight": 5
891
+ },
892
+ {
893
+ "name": "stackexchange_Title_Answer/blender.stackexchange.com.jsonl.gz",
894
+ "lines": 54153,
895
+ "weight": 5
896
+ },
897
+ {
898
+ "name": "stackexchange_TitleBody_Answer/scifi.stackexchange.com.jsonl.gz",
899
+ "lines": 54805,
900
+ "weight": 5