|
|
|
--- |
|
language: |
|
- nl |
|
- en |
|
- multilingual |
|
license: apache-2.0 |
|
tags: |
|
- dutch |
|
- english |
|
- t5 |
|
- t5x |
|
- ul2 |
|
- seq2seq |
|
- translation |
|
datasets: |
|
- yhavinga/mc4_nl_cleaned |
|
- yhavinga/nedd_wiki_news |
|
pipeline_tag: translation |
|
widget: |
|
- text: >- |
|
Redistricting and West Virginia’s shrinking population forced the state’s |
|
Republican Legislature to pit Mr. McKinley, a six-term Republican with a |
|
pragmatic bent, against Mr. Mooney, who has served four terms marked more |
|
by conservative rhetoric than legislative achievements. |
|
- text: >- |
|
It is a painful and tragic spectacle that rises before me: I have drawn |
|
back the curtain from the rottenness of man. This word, in my mouth, is at |
|
least free from one suspicion: that it involves a moral accusation against |
|
humanity. |
|
- text: >- |
|
Young Wehling was hunched in his chair, his head in his hand. He was so |
|
rumpled, so still and colorless as to be virtually invisible. His |
|
camouflage was perfect, since the waiting room had a disorderly and |
|
demoralized air, too. Chairs and ashtrays had been moved away from the |
|
walls. The floor was paved with spattered dropcloths. |
|
--- |
|
|
|
# ul2-large-en-nl for English to Dutch translation |
|
|
|
Fine-tuned T5 model on English to Dutch translation that was pretrained on Dutch using a UL2 (Mixture-of-Denoisers) objective. |
|
The T5 model was introduced in |
|
[this paper](https://arxiv.org/abs/1910.10683) |
|
and first released at [this page](https://github.com/google-research/text-to-text-transfer-transformer). |
|
The UL2 objective was introduced in |
|
[this paper](https://arxiv.org/abs/2205.05131) |
|
and first released at [this page](https://github.com/google-research/google-research/tree/master/ul2). |
|
|
|
|
|
|
|
## Model description |
|
|
|
T5 is an encoder-decoder model and treats all NLP problems in a text-to-text format. |
|
|
|
`ul2-large-en-nl` T5 is a transformers model fine-tuned on parallel sentence and paragraph pairs |
|
sampled from books. |
|
|
|
This model used the [T5 v1.1](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) improvements compared to the original T5 model during the pretraining: |
|
- GEGLU activation in the feed-forward hidden layer, rather than ReLU - see [here](https://arxiv.org/abs/2002.05202) |
|
- Dropout was turned off during pre-training. Dropout should be re-enabled during fine-tuning |
|
- Pre-trained on self-supervised objective only without mixing in the downstream tasks |
|
- No parameter sharing between embedding and classifier layer |
|
|
|
|
|
|
|
### UL2 pretraining objective |
|
|
|
This model was pretrained with the UL2's Mixture-of-Denoisers (MoD) objective, that combines diverse pre-training |
|
paradigms together. UL2 frames different objective functions for training language models as denoising tasks, where |
|
the model has to recover missing sub-sequences of a given input. During pre-training it uses a novel mixture-of-denoisers |
|
that samples from a varied set of such objectives, each with different configurations. UL2 is trained using a mixture of |
|
three denoising tasks: |
|
|
|
1. R-denoising (or regular span corruption), which emulates the standard T5 span corruption objective; |
|
2. X-denoising (or extreme span corruption); and |
|
3. S-denoising (or sequential PrefixLM). |
|
|
|
During pre-training, we sample from the available denoising tasks based on user-specified ratios. |
|
UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training |
|
denoising task. During the pre-training, a paradigm token is inserted to the input |
|
(`[NLU]` for R-denoising, `[NLG]` for X-denoising, or `[S2S]` for S-denoising) indicating the denoising task at hand. |
|
Then, during fine-tuning the same input token should be inserted to get the best performance for different downstream |
|
fine-tuning tasks. |
|
|
|
## Intended uses & limitations |
|
|
|
This model was fine-tuned on parallel sentence and paragraph pairs and can be used |
|
for machine translation. |
|
|
|
### How to use |
|
|
|
Here is how to use this model in PyTorch: |
|
|
|
```python |
|
model_name = "yhavinga/ul2-large-en-nl" |
|
from transformers import AutoTokenizer |
|
from transformers import AutoModelForSeq2SeqLM |
|
from transformers import pipeline |
|
import torch |
|
device_num = 0 if torch.cuda.is_available() else -1 |
|
device = "cpu" if device_num < 0 else f"cuda:{device_num}" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False) |
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, use_auth_token=True).to( |
|
device |
|
) |
|
params = {"max_length": 370, "num_beams": 4, "early_stopping": True} |
|
translator = pipeline("translation", tokenizer=tokenizer, model=model, device=device_num) |
|
print(translator("Young Wehling was hunched in his chair, his head in his hand. He was so rumpled, so still and colorless as to be virtually invisible.", |
|
**params)[0]['translation_text']) |
|
``` |
|
|
|
|
|
### Limitations and bias |
|
|
|
The training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral. |
|
Therefore, the model can have biased predictions. This bias will also affect all fine-tuned versions of this model. |
|
|
|
## Training data |
|
|
|
The `ul2-large-en-nl` T5 model was pre-trained simultaneously on a combination of several datasets, |
|
including the `full` config of the "mc4_nl_cleaned" dataset, which is a cleaned version of Common Crawl's web |
|
crawl corpus, Dutch books, the Dutch subset of Wikipedia (2022-03-20), and a subset of "mc4_nl_cleaned" |
|
containing only texts from Dutch and Belgian newspapers. This last dataset is oversampled to bias the model |
|
towards descriptions of events in the Netherlands and Belgium. |
|
|
|
After pre-training, the model was |
|
fine-tuned on a translation dataset containing 13 million sentence and paragraph pairs |
|
sampled from books. |
|
|
|
|
|
|
|
## Training procedure |
|
|
|
### Preprocessing |
|
|
|
The ul2-large-en-nl T5 model uses a SentencePiece unigram tokenizer with a vocabulary of 32,000 tokens. |
|
The tokenizer includes the special tokens `<pad>`, `</s>`, `<unk>`, known from the original T5 paper, |
|
`[NLU]`, `[NLG]` and `[S2S]` for the MoD pre-training, and `<n>` for newline. |
|
During pre-training with the UL2 objective, input and output sequences consist of 512 consecutive tokens. |
|
The tokenizer does not lowercase texts and is therefore case-sensitive; it distinguises |
|
between `dutch` and `Dutch`. |
|
Additionally, 100+28 extra tokens were added for pre-training tasks, resulting in a total of 32,128 tokens. |
|
|
|
### Fine-tuning |
|
|
|
This model was fine-tuned on a dataset containing 13M sentence and paragraph translation pairs sampled from books. |
|
|
|
* Pre-trained model used as starting point: yhavinga/ul2-large-dutch |
|
* Amount of fine-tune training steps: 77600 |
|
* Batch size: 512 (gradient accumulation steps: 16) |
|
* Sequence length: 370 tokens |
|
* Model dtype: bfloat16 |
|
* z_loss: 0.0001 |
|
* Optimizer: adamw_hf beta1: 0.9 beta2: 0.9969 eps: 1e-08 |
|
* Dropout rate: 0.01 |
|
* Learning rate: 0.0009 with linear decay to 0 and warmup for 500 steps |
|
* Label smoothing factor: 0.11 |
|
* Bleu score: 45.1 |
|
|
|
### Model list |
|
|
|
Models in this series: |
|
|
|
|
|
| | ul2-base-en-nl | ul2-base-nl36-en-nl | ul2-large-en-nl | |
|
|:---------------------|:-----------------|:----------------------|:------------------| |
|
| model_type | t5 | t5 | t5 | |
|
| _pipeline_tag | translation | translation | translation | |
|
| d_model | 768 | 768 | 1024 | |
|
| d_ff | 2048 | 3072 | 2816 | |
|
| num_heads | 12 | 12 | 16 | |
|
| d_kv | 64 | 64 | 64 | |
|
| num_layers | 12 | 36 | 24 | |
|
| num_decoder_layers | 12 | 36 | 24 | |
|
| feed_forward_proj | gated-silu | gated-silu | gated-silu | |
|
| dense_act_fn | silu | silu | silu | |
|
| vocab_size | 32128 | 32128 | 32128 | |
|
| tie_word_embeddings | 0 | 0 | 0 | |
|
| torch_dtype | float32 | float32 | float32 | |
|
| _gin_batch_size | 128 | 64 | 64 | |
|
| _gin_z_loss | 0.0001 | 0.0001 | 0.0001 | |
|
| _gin_t5_config_dtype | 'bfloat16' | 'bfloat16' | 'bfloat16' | |
|
|
|
## Evaluation results |
|
|
|
See the evaluation section in the interactive [Pre-training Dutch T5 Models](https://huggingface.co/spaces/yhavinga/pre-training-dutch-t5-models) blog. |
|
|
|
## Acknowledgements |
|
|
|
This project would not have been possible without compute generously provided by Google through the |
|
[TPU Research Cloud](https://sites.research.google/trc/). |
|
Thanks to the [Finnish-NLP](https://huggingface.co/Finnish-NLP) authors for releasing their code for the UL2 objective and associated task definitions. |
|
Thanks to [Stephenn Fernandes](https://huggingface.co/StephennFernandes) for helping me get started with the t5x framework. |
|
|
|
Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/) |
|
|
|
|