t5-base-dutch / README.md
yhavinga's picture
Create README.md
1586c50
metadata
language:
  - dutch
tags:
  - seq2seq
  - text-generation
datasets:
  - mc4

t5-base-dutch

This model was created during the Flax/Jax Community Week, organized by HuggingFace and TPU usage sponsored by Google. Want to give it a try? Head over to Hugging Face Spaces here.

See also the fine-tuned t5-base-dutch-demo model, that is based on this model.

Dataset

This model was trained on a cleaned version of C4. See the clean directory for the clean script.

  • Documents that contained words from a selection of the Dutch and English List of Dirty Naught Obscene and Otherwise Bad Words are removed
  • Sentences with less than 3 words are removed
  • Sentences with a word of more than 1000 characters are removed
  • Documents with less than 5 sentences are removed
  • Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies", "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.

Training

The model was trained for 63000 steps with a batch size of 128, ending in a evaluation loss = 1.79 and accuracy = 0.64.