This repo contains the training script, model and log of a train session of t5-v1_1-base on the unshuffled_deduplicated_nl part of the Oscar dataset. The code is a copy of the code in the repo https://huggingface.co/patrickvonplaten/t5-base-norwegian with the following changes:

  1. "dropout_rate" is set to 0.0
  2. The tokenizer from flax-community/t5-base-dutch is used.
  3. the following code is added after loading the data in run_t5_mlm.py clean_text is the clean function adapted from the tensorflow c4_utils for Dutch bad words. (see https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/text/c4_utils.py)
    logger.info("Cleaning datasets")
    logger.info(f"Num examples train = {len(datasets['train'])}")
    logger.info(f"Num examples valid = {len(datasets['validation'])}")
    dataset_v0 = datasets
    def f(obj):
        obj["text"] = clean_text(obj["text"])
        return obj
    
    dataset_v1 = dataset_v0.map(
        f,
        batched=False,
        num_proc=96,
    )
    datasets = dataset_v1.filter(
        lambda obj: obj['text'] is not None,
        num_proc=96,
    )
    logger.info(f"Num examples train = {len(datasets['train'])}")
    logger.info(f"Num examples valid = {len(datasets['validation'])}")

The tokenizer from flax-community/t5-base-dutch is used.

Original dutch rows in oscar:

  • Num examples train = 19771542
  • Num examples validation = 1040607

After cleaning:

  • Num examples train = 11128924
  • Num examples validation = 589258

It has trained until step 25150 with batch size 24.

Step... (24950 | Loss: 2.456451416015625, Learning Rate: 0.0018013372318819165)                                                                                                                                                               
Step... (25000 | Loss: 2.4048867225646973, Learning Rate: 0.001798725686967373)                                                                                                                                                               
Evaluating ...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.50s/it]
Step... (25000 | Loss: 2.5623574256896973, Acc: 0.5242598652839661)                                                                                                                                                                           
Step... (25050 | Loss: 2.424556255340576, Learning Rate: 0.0017961141420528293)                                                                                                                                                               
Step... (25100 | Loss: 2.4024124145507812, Learning Rate: 0.0017935027135536075)                                                                                                                                                              
Step... (25150 | Loss: 2.4446678161621094, Learning Rate: 0.0017908912850543857)                                                                                                                                                              
Training...:  42%|██████████████████████████████████████████████████████████████████████████▌                                                                                                     | 25191/59439 [12:43:06<17:17:28,  1.82s/it]
Epoch ... (1/1):   0%|                                                                                                                                                                                               | 0/1 [12:44:11<?, ?it/s]
Traceback (most recent call last):
  File "./run_t5_mlm_flax.py", line 750, in <module>
    model_inputs = data_collator(samples)
  File "./run_t5_mlm_flax.py", line 262, in __call__
    batch["input_ids"] = self.filter_input_ids(input_ids, input_ids_sentinel)
  File "./run_t5_mlm_flax.py", line 305, in filter_input_ids
    input_ids = input_ids_full[input_ids_full > 0].reshape((batch_size, -1))
ValueError: cannot reshape array of size 98111 into shape (192,newaxis)
Downloads last month
4
Hosted inference API
Text2Text Generation
This model can be loaded on the Inference API on-demand.