stefan-it commited on
Commit
56900d6
1 Parent(s): 9c1fed4

readme: add initial versioN

Browse files
Files changed (1) hide show
  1. README.md +86 -0
README.md ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: de
3
+
4
+ widget:
5
+ - text: "Heute ist sehr schönes Wetter in"
6
+
7
+ license: mit
8
+ ---
9
+
10
+ # German GPT-2 model
11
+
12
+ In this repository we release (yet another) GPT-2 model, that was trained on various texts for German.
13
+
14
+ The model is meant to be an entry point for fine-tuning on other texts, and it is definitely not as good or "dangerous" as the English GPT-3 model. We do not plan extensive PR or staged releases for this model 😉
15
+
16
+ **Note**: The model was initially released under an anonymous alias (`anonymous-german-nlp/german-gpt2`) so we now "de-anonymize" it ;)
17
+
18
+ More details about GPT-2 can be found in the great [Hugging Face](https://huggingface.co/transformers/model_doc/gpt2.html) documentation.
19
+
20
+ # Changelog
21
+
22
+ 15.11.2020: Initial release.
23
+
24
+ # Training corpora
25
+
26
+ We use pretty much the same corpora as used for training the DBMDZ BERT model, that can be found in [this repository](https://github.com/dbmdz/berts).
27
+
28
+ Thanks to the awesome Hugging Face team, it is possible to create byte-level BPE with their awesome [Tokenizers](https://github.com/huggingface/tokenizers) library.
29
+
30
+ With the previously mentioned awesome Tokenizers library we created a 52K byte-level BPE vocab based on the training corpora.
31
+
32
+ After creating the vocab, we could train the GPT-2 for German on one TPU over the complete training corpus (three epochs).
33
+
34
+ # Using the model
35
+
36
+ The model itself can be used in this way:
37
+
38
+ ```python
39
+ from transformers import AutoTokenizer, AutoModelWithLMHead
40
+
41
+ tokenizer = AutoTokenizer.from_pretrained("dbmdz/german-gpt2")
42
+
43
+ model = AutoModelWithLMHead.from_pretrained("dbmdz/german-gpt2")
44
+ ```
45
+
46
+ However, text generation is a bit more interesting, so here's an example that shows how to use the great Transformers *Pipelines* for generating text:
47
+
48
+ ```python
49
+ from transformers import pipeline
50
+
51
+ pipe = pipeline('text-generation', model="dbmdz/german-gpt2",
52
+ tokenizer="dbmdz/german-gpt2", config={'max_length':800})
53
+
54
+ text = pipe2("Der Sinn des Lebens ist es")[0]["generated_text"]
55
+
56
+ print(text)
57
+ ```
58
+
59
+ This could output this beautiful text:
60
+
61
+ ```
62
+ Der Sinn des Lebens ist es, im Geist zu verweilen, aber nicht in der Welt zu sein, sondern ganz im Geist zu leben.
63
+ Die Menschen beginnen, sich nicht nach der Natur und nach der Welt zu richten, sondern nach der Seele,'
64
+ ```
65
+
66
+ # License
67
+
68
+ All models are licensed under [MIT](LICENSE).
69
+
70
+ # Huggingface model hub
71
+
72
+ All models are available on the [Huggingface model hub](https://huggingface.co/dbmdz).
73
+
74
+ # Contact (Bugs, Feedback, Contribution and more)
75
+
76
+ For questions about our BERT models just open an issue
77
+ [here](https://github.com/stefan-it/german-gpt/issues/new) 🤗
78
+
79
+ # Acknowledgments
80
+
81
+ Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
82
+ Thanks for providing access to the TFRC ❤️
83
+
84
+ Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team,
85
+ it is possible to download both cased and uncased models from their S3 storage 🤗
86
+