nreimers commited on
Commit
8c3d6d6
1 Parent(s): 1a75bf9
README.md ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ datasets:
4
+ - flax-sentence-embeddings/stackexchange_title_body_jsonl
5
+ widget:
6
+ - text: "Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects."
7
+
8
+ license: apache-2.0
9
+ ---
10
+
11
+ # doc2query/stackexchange-title-body-t5-small-v1
12
+
13
+ This is a [doc2query](https://arxiv.org/abs/1904.08375) model based on T5 (also known as [docT5query](https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery-v2.pdf)).
14
+
15
+ It can be used for:
16
+ - **Document expansion**: You generate for your paragraphs 20-40 queries and index the paragraphs and the generates queries in a standard BM25 index like Elasticsearch, OpenSearch, or Lucene. The generated queries help to close the lexical gap of lexical search, as the generate queries contain synonyms. Further, it re-weights words giving important words a higher weight even if they appear seldomn in a paragraph. In our [BEIR](https://arxiv.org/abs/2104.08663) paper we showed that BM25+docT5query is a powerful search engine. In the [BEIR repository](https://github.com/UKPLab/beir) we have an example how to use docT5query with Pyserini.
17
+ - **Domain Specific Training Data Generation**: It can be used to generate training data to learn an embedding model. On [SBERT.net](https://www.sbert.net/examples/unsupervised_learning/query_generation/README.html) we have an example how to use the model to generate (query, text) pairs for a given collection of unlabeled texts. These pairs can then be used to train powerful dense embedding models.
18
+
19
+ ## Usage
20
+ ```python
21
+ from transformers import T5Tokenizer, T5ForConditionalGeneration
22
+
23
+ model_name = 'doc2query/stackexchange-title-body-t5-small-v1'
24
+ tokenizer = T5Tokenizer.from_pretrained(model_name)
25
+ model = T5ForConditionalGeneration.from_pretrained(model_name)
26
+
27
+ text = "Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects."
28
+
29
+
30
+ input_ids = tokenizer.encode(text, max_length=384, truncation=True, return_tensors='pt')
31
+ outputs = model.generate(
32
+ input_ids=input_ids,
33
+ max_length=64,
34
+ do_sample=True,
35
+ top_p=0.95,
36
+ num_return_sequences=5)
37
+
38
+ print("Text:")
39
+ print(text)
40
+
41
+ print("\nGenerated Queries:")
42
+ for i in range(len(outputs)):
43
+ query = tokenizer.decode(outputs[i], skip_special_tokens=True)
44
+ print(f'{i + 1}: {query}')
45
+ ```
46
+
47
+ **Note:** `model.generate()` is non-deterministic. It produces different queries each time you run it.
48
+
49
+ ## Training
50
+ This model fine-tuned [google/t5-v1_1-small](https://huggingface.co/google/t5-v1_1-small) for 321k training steps. For the training script, see the `train_script.py` in this repository.
51
+
52
+ The input-text was truncated to 384 word pieces. Output text was generated up to 64 word pieces.
53
+
54
+ This model was trained on a (title, question_body) from StackExchange.
55
+
56
+
57
+
config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "google/t5-v1_1-small",
3
+ "architectures": [
4
+ "T5ForConditionalGeneration"
5
+ ],
6
+ "d_ff": 1024,
7
+ "d_kv": 64,
8
+ "d_model": 512,
9
+ "decoder_start_token_id": 0,
10
+ "dropout_rate": 0.1,
11
+ "eos_token_id": 1,
12
+ "feed_forward_proj": "gated-gelu",
13
+ "gradient_checkpointing": false,
14
+ "initializer_factor": 1.0,
15
+ "is_encoder_decoder": true,
16
+ "layer_norm_epsilon": 1e-06,
17
+ "model_type": "t5",
18
+ "num_decoder_layers": 8,
19
+ "num_heads": 6,
20
+ "num_layers": 8,
21
+ "output_past": true,
22
+ "pad_token_id": 0,
23
+ "relative_attention_num_buckets": 32,
24
+ "tie_word_embeddings": false,
25
+ "torch_dtype": "float32",
26
+ "transformers_version": "4.10.2",
27
+ "use_cache": true,
28
+ "vocab_size": 32128
29
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a3550d201ef1a270709f414f0c8120a34b179526a8ffbec815bbaddd8effdc6d
3
+ size 307934749
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
1
+ {"eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>", "additional_special_tokens": ["<extra_id_0>", "<extra_id_1>", "<extra_id_2>", "<extra_id_3>", "<extra_id_4>", "<extra_id_5>", "<extra_id_6>", "<extra_id_7>", "<extra_id_8>", "<extra_id_9>", "<extra_id_10>", "<extra_id_11>", "<extra_id_12>", "<extra_id_13>", "<extra_id_14>", "<extra_id_15>", "<extra_id_16>", "<extra_id_17>", "<extra_id_18>", "<extra_id_19>", "<extra_id_20>", "<extra_id_21>", "<extra_id_22>", "<extra_id_23>", "<extra_id_24>", "<extra_id_25>", "<extra_id_26>", "<extra_id_27>", "<extra_id_28>", "<extra_id_29>", "<extra_id_30>", "<extra_id_31>", "<extra_id_32>", "<extra_id_33>", "<extra_id_34>", "<extra_id_35>", "<extra_id_36>", "<extra_id_37>", "<extra_id_38>", "<extra_id_39>", "<extra_id_40>", "<extra_id_41>", "<extra_id_42>", "<extra_id_43>", "<extra_id_44>", "<extra_id_45>", "<extra_id_46>", "<extra_id_47>", "<extra_id_48>", "<extra_id_49>", "<extra_id_50>", "<extra_id_51>", "<extra_id_52>", "<extra_id_53>", "<extra_id_54>", "<extra_id_55>", "<extra_id_56>", "<extra_id_57>", "<extra_id_58>", "<extra_id_59>", "<extra_id_60>", "<extra_id_61>", "<extra_id_62>", "<extra_id_63>", "<extra_id_64>", "<extra_id_65>", "<extra_id_66>", "<extra_id_67>", "<extra_id_68>", "<extra_id_69>", "<extra_id_70>", "<extra_id_71>", "<extra_id_72>", "<extra_id_73>", "<extra_id_74>", "<extra_id_75>", "<extra_id_76>", "<extra_id_77>", "<extra_id_78>", "<extra_id_79>", "<extra_id_80>", "<extra_id_81>", "<extra_id_82>", "<extra_id_83>", "<extra_id_84>", "<extra_id_85>", "<extra_id_86>", "<extra_id_87>", "<extra_id_88>", "<extra_id_89>", "<extra_id_90>", "<extra_id_91>", "<extra_id_92>", "<extra_id_93>", "<extra_id_94>", "<extra_id_95>", "<extra_id_96>", "<extra_id_97>", "<extra_id_98>", "<extra_id_99>"]}
spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d60acb128cf7b7f2536e8f38a5b18a05535c9e14c7a355904270e15b0945ea86
3
+ size 791656
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
1
+ {"eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>", "extra_ids": 100, "additional_special_tokens": ["<extra_id_0>", "<extra_id_1>", "<extra_id_2>", "<extra_id_3>", "<extra_id_4>", "<extra_id_5>", "<extra_id_6>", "<extra_id_7>", "<extra_id_8>", "<extra_id_9>", "<extra_id_10>", "<extra_id_11>", "<extra_id_12>", "<extra_id_13>", "<extra_id_14>", "<extra_id_15>", "<extra_id_16>", "<extra_id_17>", "<extra_id_18>", "<extra_id_19>", "<extra_id_20>", "<extra_id_21>", "<extra_id_22>", "<extra_id_23>", "<extra_id_24>", "<extra_id_25>", "<extra_id_26>", "<extra_id_27>", "<extra_id_28>", "<extra_id_29>", "<extra_id_30>", "<extra_id_31>", "<extra_id_32>", "<extra_id_33>", "<extra_id_34>", "<extra_id_35>", "<extra_id_36>", "<extra_id_37>", "<extra_id_38>", "<extra_id_39>", "<extra_id_40>", "<extra_id_41>", "<extra_id_42>", "<extra_id_43>", "<extra_id_44>", "<extra_id_45>", "<extra_id_46>", "<extra_id_47>", "<extra_id_48>", "<extra_id_49>", "<extra_id_50>", "<extra_id_51>", "<extra_id_52>", "<extra_id_53>", "<extra_id_54>", "<extra_id_55>", "<extra_id_56>", "<extra_id_57>", "<extra_id_58>", "<extra_id_59>", "<extra_id_60>", "<extra_id_61>", "<extra_id_62>", "<extra_id_63>", "<extra_id_64>", "<extra_id_65>", "<extra_id_66>", "<extra_id_67>", "<extra_id_68>", "<extra_id_69>", "<extra_id_70>", "<extra_id_71>", "<extra_id_72>", "<extra_id_73>", "<extra_id_74>", "<extra_id_75>", "<extra_id_76>", "<extra_id_77>", "<extra_id_78>", "<extra_id_79>", "<extra_id_80>", "<extra_id_81>", "<extra_id_82>", "<extra_id_83>", "<extra_id_84>", "<extra_id_85>", "<extra_id_86>", "<extra_id_87>", "<extra_id_88>", "<extra_id_89>", "<extra_id_90>", "<extra_id_91>", "<extra_id_92>", "<extra_id_93>", "<extra_id_94>", "<extra_id_95>", "<extra_id_96>", "<extra_id_97>", "<extra_id_98>", "<extra_id_99>"], "model_max_length": 512, "name_or_path": "google/t5-v1_1-small", "special_tokens_map_file": "/root/.cache/huggingface/transformers/3ad6f8335c1b1ef8966245899d47dcf735abd134d21fd7d26f621fe45ac01184.c94798918c92ded6aeef2d2f0e666d2cc4145eca1aa6e1336fde07f2e13e2f46", "sp_model_kwargs": {}, "tokenizer_class": "T5Tokenizer"}
train_script.py ADDED
@@ -0,0 +1,210 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import logging
3
+ from torch.utils.data import Dataset, IterableDataset
4
+ import gzip
5
+ import json
6
+ from transformers import Seq2SeqTrainer, AutoModelForSeq2SeqLM, AutoTokenizer, Seq2SeqTrainingArguments
7
+ import sys
8
+ from datetime import datetime
9
+ import torch
10
+ import random
11
+ from shutil import copyfile
12
+ import os
13
+ import wandb
14
+ import re
15
+
16
+
17
+ logging.basicConfig(
18
+ format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
19
+ datefmt="%Y-%m-%d %H:%M:%S",
20
+ handlers=[logging.StreamHandler(sys.stdout)],
21
+ )
22
+
23
+ parser = argparse.ArgumentParser()
24
+ parser.add_argument("--model_name", default="google/t5-v1_1-base")
25
+ parser.add_argument("--train_files", required=True, nargs='+', default=[])
26
+ parser.add_argument("--epochs", default=1, type=int)
27
+ parser.add_argument("--batch_size", default=32, type=int)
28
+ parser.add_argument("--max_source_length", default=320, type=int)
29
+ parser.add_argument("--max_target_length", default=64, type=int)
30
+ parser.add_argument("--name", required=True)
31
+ parser.add_argument("--train_size", default=10*1000*1000, type=int)
32
+ parser.add_argument("--eval_size", default=10000, type=int)
33
+ parser.add_argument("--fp16", default=False, action='store_true')
34
+ args = parser.parse_args()
35
+
36
+ wandb.init(project="doc2query", name=f"{args.name}-{args.model_name}")
37
+
38
+
39
+
40
+
41
+ class PairDataset:
42
+ def __init__(self, filepath):
43
+ self.filepath = filepath
44
+ self.examples = []
45
+
46
+ def __iter__(self):
47
+ print("open", self.filepath)
48
+ with gzip.open(self.filepath, 'rt') as fIn:
49
+ for line in fIn:
50
+ example = self.get_example(json.loads(line))
51
+ if example is not None:
52
+ self.examples.append(example)
53
+ yield example
54
+
55
+ while True:
56
+ random.shuffle(self.examples)
57
+ for ex in self.examples:
58
+ yield ex
59
+
60
+
61
+ def get_example(self, raw_example):
62
+ return [raw_example[0], raw_example[1]]
63
+
64
+
65
+ class RedditTitleDataset(PairDataset):
66
+ def get_example(self, raw_example):
67
+ return [self.clean_title(raw_example['title']), raw_example['body']]
68
+
69
+
70
+ def clean_title(self, text):
71
+ text = text.replace("&amp;", "&").strip()
72
+ if text.startswith("["):
73
+ text = re.sub("^\[[a-zA-Z0-9]+\]", "", text).strip()
74
+
75
+ if text.endswith("]"):
76
+ text = re.sub("\[[a-zA-Z0-9\.]+\]$", "", text).strip()
77
+
78
+ if text.startswith("/r"):
79
+ text = re.sub("^/[a-zA-Z0-9/]+[;,: \-]+", "", text).strip()
80
+
81
+ return text
82
+
83
+
84
+ class StackExchangeTitleBodyDataset(PairDataset):
85
+ def get_example(self, raw_example):
86
+ return raw_example['texts']
87
+
88
+
89
+ class MultiDataset(IterableDataset):
90
+ def __init__(self, filepaths, num_samples):
91
+ self.num_samples = num_samples
92
+ self.datasets = []
93
+ self.data_iterators = []
94
+
95
+ for filepath in filepaths:
96
+ if 'reddit_title_text' in filepath:
97
+ dataset = RedditTitleDataset(filepath)
98
+ if 'stackexchange_archive/jsonl' in filepath:
99
+ dataset = StackExchangeTitleBodyDataset(filepath)
100
+ else:
101
+ dataset = PairDataset(filepath)
102
+ self.datasets.append(dataset)
103
+ self.data_iterators.append(iter(dataset))
104
+
105
+ def __len__(self):
106
+ return self.num_samples
107
+
108
+ def __iter__(self):
109
+ while True:
110
+ for dataset in self.data_iterators:
111
+ yield next(dataset)
112
+
113
+ random.shuffle(self.data_iterators)
114
+
115
+ def delete_examples_cache(self):
116
+ for dataset in self.datasets:
117
+ dataset.examples = []
118
+
119
+
120
+
121
+ def main():
122
+ ############ Model
123
+ model = AutoModelForSeq2SeqLM.from_pretrained(args.model_name)
124
+ tokenizer = AutoTokenizer.from_pretrained(args.model_name)
125
+
126
+ save_steps = 1000
127
+
128
+ output_dir = 'output/'+args.name+'-'+args.model_name.replace("/", "-")+'-'+datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
129
+ print("Output dir:", output_dir)
130
+
131
+ # Write self to path
132
+ os.makedirs(output_dir, exist_ok=True)
133
+
134
+ train_script_path = os.path.join(output_dir, 'train_script.py')
135
+ copyfile(__file__, train_script_path)
136
+ with open(train_script_path, 'a') as fOut:
137
+ fOut.write("\n\n# Script was called via:\n#python " + " ".join(sys.argv))
138
+
139
+ ####
140
+
141
+ training_args = Seq2SeqTrainingArguments(
142
+ output_dir=output_dir,
143
+ fp16=args.fp16,
144
+ fp16_backend="amp",
145
+ per_device_train_batch_size=args.batch_size,
146
+ evaluation_strategy="steps",
147
+ save_steps=save_steps,
148
+ logging_steps=100,
149
+ eval_steps=save_steps, #logging_steps,
150
+ warmup_steps=1000,
151
+ save_total_limit=1,
152
+ num_train_epochs=args.epochs,
153
+ report_to="wandb",
154
+ )
155
+
156
+ ############ Arguments
157
+
158
+ ############ Load datasets
159
+
160
+
161
+ train_dataset = MultiDataset(args.train_files, args.train_size)
162
+ train_dataset_iter = iter(train_dataset)
163
+ eval_dataset = [next(train_dataset_iter) for _ in range(args.eval_size)]
164
+ train_dataset.delete_examples_cache() #Make sure dev data is no re-used for training
165
+ print("Target:", eval_dataset[0][0])
166
+ print("Input:", eval_dataset[0][1])
167
+
168
+ print("Train dataset len:", len(train_dataset))
169
+
170
+
171
+ def data_collator(examples):
172
+ targets = [row[0] for row in examples]
173
+ inputs = [row[1] for row in examples]
174
+ label_pad_token_id = -100
175
+
176
+ model_inputs = tokenizer(inputs, max_length=args.max_source_length, padding=True, truncation=True, return_tensors='pt', pad_to_multiple_of=8 if training_args.fp16 else None)
177
+
178
+ # Setup the tokenizer for targets
179
+ with tokenizer.as_target_tokenizer():
180
+ labels = tokenizer(targets, max_length=args.max_target_length, padding=True, truncation=True, pad_to_multiple_of=8 if training_args.fp16 else None)
181
+
182
+ # replace all tokenizer.pad_token_id in the labels by -100 to ignore padding in the loss.
183
+ labels["input_ids"] = [
184
+ [(l if l != tokenizer.pad_token_id else label_pad_token_id) for l in label] for label in labels["input_ids"]
185
+ ]
186
+
187
+
188
+ model_inputs["labels"] = torch.tensor(labels["input_ids"])
189
+ return model_inputs
190
+
191
+ ## Define the trainer
192
+ trainer = Seq2SeqTrainer(
193
+ model=model,
194
+ args=training_args,
195
+ train_dataset=train_dataset,
196
+ eval_dataset=eval_dataset,
197
+ tokenizer=tokenizer,
198
+ data_collator=data_collator
199
+ )
200
+
201
+ ### Save the model
202
+ train_result = trainer.train()
203
+ trainer.save_model()
204
+
205
+
206
+ if __name__ == "__main__":
207
+ main()
208
+
209
+ # Script was called via:
210
+ #python train_hf_trainer.py --model_name google/t5-v1_1-small --train_files /home/stackexchange_archive/jsonl/academia.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/android.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/anime.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/apple.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/arduino.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/askubuntu.com.jsonl.gz /home/stackexchange_archive/jsonl/astronomy.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/aviation.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/bicycles.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/biology.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/bitcoin.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/blender.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/boardgames.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/chemistry.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/christianity.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/civicrm.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/codereview.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/cooking.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/craftcms.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/crypto.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/cs.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/cstheory.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/datascience.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/dba.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/diy.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/drupal.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/dsp.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/economics.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/electronics.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/ell.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/emacs.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/engineering.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/english.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/ethereum.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/expressionengine.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/french.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/gamedev.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/gaming.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/gardening.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/german.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/gis.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/graphicdesign.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/hinduism.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/history.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/islam.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/japanese.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/judaism.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/law.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/magento.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/math.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/mathematica.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/mathoverflow.net.jsonl.gz /home/stackexchange_archive/jsonl/mechanics.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/meta.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/meta.stackoverflow.com.jsonl.gz /home/stackexchange_archive/jsonl/money.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/movies.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/music.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/networkengineering.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/philosophy.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/photo.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/physics.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/politics.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/puzzling.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/quant.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/raspberrypi.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/rpg.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/rus.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/salesforce.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/scifi.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/security.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/serverfault.com.jsonl.gz /home/stackexchange_archive/jsonl/sharepoint.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/skeptics.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/small_stackexchanges.jsonl.gz /home/stackexchange_archive/jsonl/softwareengineering.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/softwarerecs.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/space.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/stackoverflow.com-Posts.jsonl.gz /home/stackexchange_archive/jsonl/stats.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/superuser.com.jsonl.gz /home/stackexchange_archive/jsonl/tex.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/travel.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/unix.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/ux.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/vi.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/webapps.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/webmasters.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/wordpress.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/workplace.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/worldbuilding.stackexchange.com.jsonl.gz /home/stackexchange_archive/jsonl/writers.stackexchange.com.jsonl.gz --name stackexchange_title_text_all --train_size 100000000 --max_source_length 384
trainer_state.json ADDED
The diff for this file is too large to render. See raw diff
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7135db800ae55b7878c857de6c5c8d1a3edb3c6a8da5abad8c1b20f943ac54ed
3
+ size 2927