yhavinga commited on
Commit
1586c50
1 Parent(s): 29fb4ca

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -0
README.md ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - dutch
4
+ tags:
5
+ - seq2seq
6
+ - text-generation
7
+ datasets:
8
+ - mc4
9
+ ---
10
+
11
+ # t5-base-dutch
12
+ > This model was created during the
13
+ [Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-t5-from-scratch-in-dutch/8109), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
14
+ Want to give it a try? Head over to Hugging Face Spaces [here](https://huggingface.co/spaces/flax-community/netherformer).
15
+
16
+ See also the fine-tuned [t5-base-dutch-demo](https://huggingface.co/flax-community/t5-base-dutch-demo) model, that is based on this model.
17
+
18
+ ## Dataset
19
+
20
+ This model was trained on a cleaned version of C4.
21
+ See the `clean` directory for the clean script.
22
+
23
+ * Documents that contained words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) are removed
24
+ * Sentences with less than 3 words are removed
25
+ * Sentences with a word of more than 1000 characters are removed
26
+ * Documents with less than 5 sentences are removed
27
+ * Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
28
+ "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
29
+
30
+ ## Training
31
+
32
+ The model was trained for 63000 steps with a batch size of 128, ending in a evaluation loss = 1.79 and accuracy = 0.64.