Yeb Havinga commited on
Commit
af023ba
1 Parent(s): d71958a
Files changed (1) hide show
  1. index.html +18 -5
index.html CHANGED
@@ -16,7 +16,9 @@
16
  <p>TL;DR, Look below for <a href="#model-list">the list of pre-trained Dutch and Dutch+English models</a>.</p>
17
 
18
  <p md-src-pos="28..495"><span md-src-pos="28..64">A few months ago, I was given access to Google's TPU Research Cloud (TRC). My goal was to train several Dutch and Dutch+English T5 models, limited to model sizes that can run on a single GPU.
19
- The T5 model architecture is a text Seq2Seq encoder/decoder model architecture. Since it encodes all inputs and outputs as text, it can be fine-tuned on a wide range of tasks.</span></p>
 
 
20
  <ul md-src-pos="497..2062">
21
  <li md-src-pos="497..751"><strong md-src-pos="499..624"><a target="_blank" href="https://arxiv.org/abs/1910.10683.pdf" md-src-pos="501..622">Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer</a></strong> by <em md-src-pos="628..750">Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu</em>.</li>
22
  <li md-src-pos="752..1482"><strong md-src-pos="754..859"><a target="_blank" href="https://arxiv.org/abs/2110.08207" md-src-pos="756..857">Multitask Prompted Training Enables Zero-Shot Task Generalization</a></strong> by <em md-src-pos="863..1481">Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Tali Bers, Stella Biderman, Leo Gao, Thomas Wolf, Alexander M. Rush</em>.</li>
@@ -29,11 +31,19 @@ The T5 model architecture is a text Seq2Seq encoder/decoder model architecture.
29
  <li md-src-pos="2203..2305"><a target="_blank" href="https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104" md-src-pos="2205..2305">https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104</a></li>
30
  <li md-src-pos="2306..2407"><a target="_blank" href="https://github.com/huggingface/transformers/tree/main/examples/research_projects/jax-projects#talks" md-src-pos="2308..2407">https://github.com/huggingface/transformers/tree/main/examples/research_projects/jax-projects#talks</a></li>
31
  </ul>
32
-
 
 
 
33
  <h2 md-src-pos="18893..18908">Pre-training</h2>
34
  <h3 md-src-pos="18910..18925">mC4 dataset</h3>
35
  <p>
36
- A few weeks before the <a target="_blank" href="https://github.com/huggingface/transformers/tree/main/examples/research_projects/jax-projects#talks">Flax/JAX Community Week</a> started, the multilingual C4 (mC4) TensorFlow dataset was prepared and <a target="_blank" href="https://huggingface.co/datasets/allenai/c4">released</a> by AllenNLP. This dataset was created by the original T5 authors and is composed of text files in many languages. We cleaned Dutch mC4 with <a target="_blank" href="https://gitlab.com/yhavinga/c4nlpreproc">code adapted</a> from the C4 TensorFlow dataset, and used the resulting text files in the pre-training scripts. We also verified that Dutch C4 was deduplicated.</p>
 
 
 
 
 
37
  <p>
38
  To be able to easily reuse this dataset for more pre-training sessions with Huggingfaces scripts, a Huggingface dataset was created: <a target="_blank" href="https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned" md-src-pos="19449..19522">mc4_nl_cleaned</a>. For Dutch and English training, a couple of additional configs were added to the generation script. These configs produce interleaved Dutch and English texts with a 1:1 ratio. For instance, the <a target="_blank" href="https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned/viewer/micro_en_nl/train" md-src-pos="19690..19792">micro_en_nl config</a> config mixes Dutch with English samples.
39
  The cleaned English C4 dataset is about 5 times larger (in compressed bytes) than the Dutch part. 1:1 interleaving with Dutch discards about 80% of English C4.
@@ -41,11 +51,14 @@ A few weeks before the <a target="_blank" href="https://github.com/huggingface/t
41
  </p>
42
 
43
  <h3 md-src-pos="20163..20243">Unsupervised Training Objective</h3>
44
- <p md-src-pos="2409..2753"><span md-src-pos="2409..2463">The Dutch and Dutch+English T5 models are pre-trained using the masked language modeling (MLM) objective.
 
45
  During pre-training, 15% of the tokens are masked and each span of masked tokens is replaced by a sentinel token.</span>
46
  </p>
47
  <h3 md-src-pos="20163..20243">Why are some models trained for multiple epochs on a smaller config?</h3>
48
- <p>When I was using an old version of the Flax mlm pretraining script, I noticed that the per-batch training speed seemed slower at the beginning of epochs when a larger dataset config was used. Also, on large configs, batch shuffling would fail with a TPU out-of-memory error. For these reasons, I started experimenting with training for more epochs on smaller configs.
 
 
49
  </p>
50
  <p><span md-src-pos="20616..20634">This should be ok.</span> <span md-src-pos="20635..20717">In the original T5 paper downstream performance was compared between training on 2</span><sup><span md-src-pos="20722..20724">35</span></sup> <span md-src-pos="20731..20749">tokens vs training</span> <span md-src-pos="20750..20784">multiple epochs on a smaller part.</span> <span md-src-pos="20785..20800">64 repeats of 2</span><sup><span md-src-pos="20805..20807">29</span></sup> <span md-src-pos="20814..20871">tokens did not result in degraded downstream performance.</span> <span md-src-pos="20872..20881">The model</span> <code md-src-pos="20882..20925">yhavinga/t5-v1_1-base-dutch-english-cased</code> <span md-src-pos="20926..20943">is trained on the</span> <code md-src-pos="20944..20951">small</code> <span md-src-pos="20952..20973">config for 10 epochs.</span> </p>
51
  <p><span>
16
  <p>TL;DR, Look below for <a href="#model-list">the list of pre-trained Dutch and Dutch+English models</a>.</p>
17
 
18
  <p md-src-pos="28..495"><span md-src-pos="28..64">A few months ago, I was given access to Google's TPU Research Cloud (TRC). My goal was to train several Dutch and Dutch+English T5 models, limited to model sizes that can run on a single GPU.
19
+ T5 is a text-to-text transfer transformer, a neural network model with
20
+ natural language text as input and output.
21
+ It can be fine-tuned on a wide range of tasks.</span></p>
22
  <ul md-src-pos="497..2062">
23
  <li md-src-pos="497..751"><strong md-src-pos="499..624"><a target="_blank" href="https://arxiv.org/abs/1910.10683.pdf" md-src-pos="501..622">Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer</a></strong> by <em md-src-pos="628..750">Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu</em>.</li>
24
  <li md-src-pos="752..1482"><strong md-src-pos="754..859"><a target="_blank" href="https://arxiv.org/abs/2110.08207" md-src-pos="756..857">Multitask Prompted Training Enables Zero-Shot Task Generalization</a></strong> by <em md-src-pos="863..1481">Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Tali Bers, Stella Biderman, Leo Gao, Thomas Wolf, Alexander M. Rush</em>.</li>
31
  <li md-src-pos="2203..2305"><a target="_blank" href="https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104" md-src-pos="2205..2305">https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104</a></li>
32
  <li md-src-pos="2306..2407"><a target="_blank" href="https://github.com/huggingface/transformers/tree/main/examples/research_projects/jax-projects#talks" md-src-pos="2308..2407">https://github.com/huggingface/transformers/tree/main/examples/research_projects/jax-projects#talks</a></li>
33
  </ul>
34
+ <p>
35
+ This project is a continuation of the work I performed together with
36
+ Dat Nguyen during the <a target="_blank" href="https://github.com/huggingface/transformers/tree/main/examples/research_projects/jax-projects#talks">Flax/JAX Community
37
+ Week</a> to create a T5 model pre-trained from scratch on Dutch.
38
  <h2 md-src-pos="18893..18908">Pre-training</h2>
39
  <h3 md-src-pos="18910..18925">mC4 dataset</h3>
40
  <p>
41
+ The <a target="_blank"
42
+ href="https://huggingface.co/datasets/allenai/c4">multilingual C4 (mC4)
43
+ dataset</a> was created by the original T5 authors.
44
+ It was prepared and released by AllenNLP
45
+ on the Huggingface Dataset hub.
46
+ Our team cleaned Dutch mC4 with <a target="_blank" href="https://gitlab.com/yhavinga/c4nlpreproc">code adapted</a> from the C4 TensorFlow dataset, and used the resulting text files in the pre-training scripts. We also verified that Dutch C4 was deduplicated.</p>
47
  <p>
48
  To be able to easily reuse this dataset for more pre-training sessions with Huggingfaces scripts, a Huggingface dataset was created: <a target="_blank" href="https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned" md-src-pos="19449..19522">mc4_nl_cleaned</a>. For Dutch and English training, a couple of additional configs were added to the generation script. These configs produce interleaved Dutch and English texts with a 1:1 ratio. For instance, the <a target="_blank" href="https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned/viewer/micro_en_nl/train" md-src-pos="19690..19792">micro_en_nl config</a> config mixes Dutch with English samples.
49
  The cleaned English C4 dataset is about 5 times larger (in compressed bytes) than the Dutch part. 1:1 interleaving with Dutch discards about 80% of English C4.
51
  </p>
52
 
53
  <h3 md-src-pos="20163..20243">Unsupervised Training Objective</h3>
54
+ <p md-src-pos="2409..2753"><span md-src-pos="2409..2463">The Dutch and Dutch+English T5 models are pre-trained
55
+ with the masked language modeling (MLM) "span corruption" objective.
56
  During pre-training, 15% of the tokens are masked and each span of masked tokens is replaced by a sentinel token.</span>
57
  </p>
58
  <h3 md-src-pos="20163..20243">Why are some models trained for multiple epochs on a smaller config?</h3>
59
+ <p>When I was using an old version of the <a target="_blank"
60
+ href="https://github.com/huggingface/transformers/blob/7e44226fc75aa1e5f8928c6445f1979343ea782f/examples/flax/language-modeling/run_t5_mlm_flax.py">Flax
61
+ T5 MLM pretraining script</a>, I noticed that the per-batch training speed seemed slower at the beginning of epochs when a larger dataset config was used. Also, on large configs, batch shuffling would fail with a TPU out-of-memory error. For these reasons, I started experimenting with training for more epochs on smaller configs.
62
  </p>
63
  <p><span md-src-pos="20616..20634">This should be ok.</span> <span md-src-pos="20635..20717">In the original T5 paper downstream performance was compared between training on 2</span><sup><span md-src-pos="20722..20724">35</span></sup> <span md-src-pos="20731..20749">tokens vs training</span> <span md-src-pos="20750..20784">multiple epochs on a smaller part.</span> <span md-src-pos="20785..20800">64 repeats of 2</span><sup><span md-src-pos="20805..20807">29</span></sup> <span md-src-pos="20814..20871">tokens did not result in degraded downstream performance.</span> <span md-src-pos="20872..20881">The model</span> <code md-src-pos="20882..20925">yhavinga/t5-v1_1-base-dutch-english-cased</code> <span md-src-pos="20926..20943">is trained on the</span> <code md-src-pos="20944..20951">small</code> <span md-src-pos="20952..20973">config for 10 epochs.</span> </p>
64
  <p><span>