cakiki commited on
Commit
c25e744
1 Parent(s): fdcc753

Add model card

Browse files
Files changed (2) hide show
  1. README.md +70 -0
  2. training.md +146 -0
README.md ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ -
4
+ -
5
+ thumbnail:
6
+ tags:
7
+ -
8
+ -
9
+ -
10
+ license:
11
+ datasets:
12
+ -
13
+ -
14
+ metrics:
15
+ -
16
+ -
17
+ ---
18
+
19
+ # GPT-2 GERMAN
20
+
21
+ ## Model description
22
+
23
+ TODO
24
+ ## Intended uses & limitations
25
+
26
+ #### How to use
27
+
28
+ ```python
29
+ # You can include sample code which will be formatted
30
+ ```
31
+
32
+ #### Limitations and bias
33
+
34
+ Provide examples of latent issues and potential remediations.
35
+
36
+ ## Training data
37
+
38
+ https://huggingface.co/datasets/german-nlp-group/german_common_crawl
39
+
40
+ ```json
41
+ {'url': 'http://my-shop.ru/shop/books/545473.html',
42
+ 'date_download': '2016-10-20T19:38:58Z',
43
+ 'digest': 'sha1:F62EMGYLZDIKF4UL5JZYU47KWGGUBT7T',
44
+ 'length': 1155,
45
+ 'nlines': 4,
46
+ 'source_domain': 'my-shop.ru',
47
+ 'title': 'Grammatikalische Liebeslieder. Methodische Vorschläge',
48
+ 'raw_content': 'Grammatikalische Liebeslieder. [....]',
49
+ 'cc_segment': 'crawl-data/CC-MAIN-2016-44/segments/1476988717783.68/wet/CC-MAIN-20161020183837-00354-ip-10-171-6-4.ec2.internal.warc.wet.gz',
50
+ 'original_nlines': 99,
51
+ 'original_length': 2672,
52
+ 'language': 'de',
53
+ 'language_score': 1.0,
54
+ 'perplexity': 283.0,
55
+ 'bucket': 'head'}"
56
+ ```
57
+
58
+ ## Training procedure
59
+
60
+ TODO (See training.md)
61
+
62
+ ## Eval results
63
+
64
+ ### BibTeX entry and citation info
65
+
66
+ ```bibtex
67
+ @inproceedings{...,
68
+ year={2021}
69
+ }
70
+ ```
training.md ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!---
2
+ Copyright 2021 The HuggingFace Team. All rights reserved.
3
+
4
+ Licensed under the Apache License, Version 2.0 (the "License");
5
+ you may not use this file except in compliance with the License.
6
+ You may obtain a copy of the License at
7
+
8
+ http://www.apache.org/licenses/LICENSE-2.0
9
+
10
+ Unless required by applicable law or agreed to in writing, software
11
+ distributed under the License is distributed on an "AS IS" BASIS,
12
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ See the License for the specific language governing permissions and
14
+ limitations under the License.
15
+ -->
16
+
17
+ # Language model training examples
18
+
19
+ The following example showcases how to train a language model from scratch
20
+ using the JAX/Flax backend.
21
+
22
+ JAX/Flax allows you to trace pure functions and compile them into efficient, fused accelerator code on both GPU and TPU.
23
+ Models written in JAX/Flax are **immutable** and updated in a purely functional
24
+ way which enables simple and efficient model parallelism.
25
+
26
+ ## Causal language modeling
27
+
28
+ In the following, we demonstrate how to train an auto-regressive causal transformer model
29
+ in JAX/Flax.
30
+ More specifically, we pretrain a randomely initialized [**`gpt2`**](https://huggingface.co/gpt2) model in Norwegian on a single TPUv3-8.
31
+ to pre-train 124M [**`gpt2`**](https://huggingface.co/gpt2)
32
+ in Norwegian on a single TPUv3-8 pod.
33
+
34
+ The example script uses the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets.
35
+
36
+ Let's start by creating a model repository to save the trained model and logs.
37
+ Here we call the model `"norwegian-gpt2"`, but you can change the model name as you like.
38
+
39
+ You can do this either directly on [huggingface.co](https://huggingface.co/new) (assuming that
40
+ you are logged in) or via the command line:
41
+
42
+ ```
43
+ huggingface-cli repo create norwegian-gpt2
44
+ ```
45
+
46
+ Next we clone the model repository to add the tokenizer and model files.
47
+
48
+ ```
49
+ git clone https://huggingface.co/<your-username>/norwegian-gpt2
50
+ ```
51
+
52
+ To ensure that all tensorboard traces will be uploaded correctly, we need to
53
+ track them. You can run the following command inside your model repo to do so.
54
+
55
+ ```
56
+ cd norwegian-gpt2
57
+ git lfs track "*tfevents*"
58
+ ```
59
+
60
+ Great, we have set up our model repository. During training, we will automatically
61
+ push the training logs and model weights to the repo.
62
+
63
+ Next, let's add a symbolic link to the `run_clm_flax.py`.
64
+
65
+ ```bash
66
+ export MODEL_DIR="./norwegian-gpt2"
67
+ ln -s ~/transformers/examples/flax/language-modeling/run_clm_flax.py run_clm_flax.py
68
+ ```
69
+
70
+ ### Train tokenizer
71
+
72
+ In the first step, we train a tokenizer to efficiently process the text input for the model. Similar to how it is shown in [How to train a new language model from scratch using Transformers and Tokenizers](https://huggingface.co/blog/how-to-train), we use a **`ByteLevelBPETokenizer`**.
73
+ The tokenizer is trained on the complete Norwegian dataset of OSCAR
74
+ and consequently saved in `${MODEL_DIR}`
75
+ This can take up to 10 minutes depending on your hardware ☕.
76
+
77
+ ```python
78
+ from datasets import load_dataset
79
+ from tokenizers import trainers, Tokenizer, normalizers, ByteLevelBPETokenizer
80
+
81
+ model_dir = "./norwegian-roberta-base" # ${MODEL_DIR}
82
+
83
+ # load dataset
84
+ dataset = load_dataset("oscar", "unshuffled_deduplicated_no", split="train")
85
+
86
+ # Instantiate tokenizer
87
+ tokenizer = ByteLevelBPETokenizer()
88
+
89
+ def batch_iterator(batch_size=1000):
90
+ for i in range(0, len(dataset), batch_size):
91
+ yield dataset[i: i + batch_size]["text"]
92
+
93
+ # Customized training
94
+ tokenizer.train_from_iterator(batch_iterator(), vocab_size=50265, min_frequency=2, special_tokens=[
95
+ "<s>",
96
+ "<pad>",
97
+ "</s>",
98
+ "<unk>",
99
+ "<mask>",
100
+ ])
101
+
102
+ # Save files to disk
103
+ tokenizer.save(f"{model_dir}/tokenizer.json")
104
+ ```
105
+ ### Create configuration
106
+
107
+ Next, we create the model's configuration file. This is as simple
108
+ as loading and storing [`**gpt2**`](https://huggingface.co/gpt2)
109
+ in the local model folder:
110
+
111
+ ```python
112
+ from transformers import GPT2Config
113
+
114
+ model_dir = "./norwegian-gpt2" # ${MODEL_DIR}
115
+
116
+ config = GPT2Config.from_pretrained("gpt2", resid_pdrop=0.0, embd_pdrop=0.0, attn_pdrop=0.0)
117
+ config.save_pretrained(model_dir)
118
+ ```
119
+
120
+ ### Train model
121
+
122
+ Next we can run the example script to pretrain the model:
123
+
124
+ ```bash
125
+ ./run_clm_flax.py \
126
+ --output_dir="${MODEL_DIR}" \
127
+ --model_type="gpt2" \
128
+ --config_name="${MODEL_DIR}" \
129
+ --tokenizer_name="${MODEL_DIR}" \
130
+ --dataset_name="oscar" \
131
+ --dataset_config_name="unshuffled_deduplicated_no" \
132
+ --do_train --do_eval \
133
+ --block_size="512" \
134
+ --per_device_train_batch_size="64" \
135
+ --per_device_eval_batch_size="64" \
136
+ --learning_rate="5e-3" --warmup_steps="1000" \
137
+ --adam_beta1="0.9" --adam_beta2="0.98" --weight_decay="0.01" \
138
+ --overwrite_output_dir \
139
+ --num_train_epochs="20" \
140
+ --push_to_hub
141
+ ```
142
+
143
+ Training should converge at a loss and perplexity
144
+ of 3.24 and 25.72 respectively after 20 epochs on a single TPUv3-8.
145
+ This should take less than ~21 hours.
146
+ Training statistics can be accessed on [tfhub.de](https://tensorboard.dev/experiment/2zEhLwJ0Qp2FAkI3WVH9qA).