yhavinga commited on
Commit
95795c3
1 Parent(s): dd3b617

Update README

Browse files
Files changed (2) hide show
  1. README.md +7 -4
  2. run_gpt.sh +3 -2
README.md CHANGED
@@ -14,10 +14,11 @@ datasets:
14
  ---
15
  # GPT2-Medium pre-trained on cleaned Dutch mC4 🇳🇱
16
 
17
- Dataset:
18
 
19
- * [mC4 NL Cleaned](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned)
20
- * dataset config: full (33B tokens)
 
21
 
22
  Tokenizer:
23
 
@@ -26,12 +27,14 @@ Tokenizer:
26
 
27
  Training details:
28
 
29
- * Trained for 280k steps (30 dec 2021)
30
  * Block size: 512
31
  * Optimizer: adam, lr 8e-4, beta1 0.9, beta2 0.98
32
  * Warmup steps: 5000
33
  * Weight decay: 0.01
34
 
 
 
35
  Work in progress. Dec 2021-Jan2022
36
 
37
  * Many thanks to the [Google TPU Research Cloud](https://sites.research.google/trc/about/) for providing access to a TPU cluster!
 
14
  ---
15
  # GPT2-Medium pre-trained on cleaned Dutch mC4 🇳🇱
16
 
17
+ Datasets:
18
 
19
+ * [mC4 NL Cleaned](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned), dataset config: full (33B tokens)
20
+ * A recreation of the TBC but for the Dutch language (see e.g.
21
+ https://github.com/sgraaf/Replicate-Toronto-BookCorpus)
22
 
23
  Tokenizer:
24
 
 
27
 
28
  Training details:
29
 
30
+ * Trained for 320k steps (30 dec 2021)
31
  * Block size: 512
32
  * Optimizer: adam, lr 8e-4, beta1 0.9, beta2 0.98
33
  * Warmup steps: 5000
34
  * Weight decay: 0.01
35
 
36
+ Further fine-tuned on a Dutch book corpus.
37
+
38
  Work in progress. Dec 2021-Jan2022
39
 
40
  * Many thanks to the [Google TPU Research Cloud](https://sites.research.google/trc/about/) for providing access to a TPU cluster!
run_gpt.sh CHANGED
@@ -15,6 +15,7 @@ python run_clm_flax.py \
15
  --output_dir="${MODEL_PATH}" \
16
  --model_type="gpt2" \
17
  --config_name="${MODEL_PATH}" \
 
18
  --tokenizer_name="${MODEL_PATH}" \
19
  --preprocessing_num_workers="96" \
20
  --do_train --do_eval \
@@ -26,9 +27,9 @@ python run_clm_flax.py \
26
  --learning_rate="0.0024" --warmup_steps="5000" \
27
  --adam_beta1="0.9" --adam_beta2="0.98" --weight_decay="0.01" \
28
  --overwrite_output_dir \
29
- --num_train_epochs="1" \
30
  --logging_steps="500" \
31
- --save_steps="40000" \
32
  --eval_steps="2500"
33
 
34
  # \
 
15
  --output_dir="${MODEL_PATH}" \
16
  --model_type="gpt2" \
17
  --config_name="${MODEL_PATH}" \
18
+ --model_name_or_path="${MODEL_PATH}" \
19
  --tokenizer_name="${MODEL_PATH}" \
20
  --preprocessing_num_workers="96" \
21
  --do_train --do_eval \
 
27
  --learning_rate="0.0024" --warmup_steps="5000" \
28
  --adam_beta1="0.9" --adam_beta2="0.98" --weight_decay="0.01" \
29
  --overwrite_output_dir \
30
+ --num_train_epochs="4" \
31
  --logging_steps="500" \
32
+ --save_steps="10001" \
33
  --eval_steps="2500"
34
 
35
  # \