kkirchheim commited on
Commit
f8b7775
·
verified ·
1 Parent(s): 99068d2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +162 -1
README.md CHANGED
@@ -3,4 +3,165 @@ license: mit
3
  language:
4
  - de
5
  library_name: transformers
6
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  language:
4
  - de
5
  library_name: transformers
6
+ ---
7
+ # German GPT-2 Medium (355M Parameters)
8
+
9
+ This model is a German variant of GPT-2 with approximately 355 million parameters and a context window of 2048 tokens.
10
+ It is pre-trained on 300 GB of German text data and is intended as a base model for text generation tasks in the German language.
11
+
12
+ - **Model Architecture:** GPT-2 medium architecture adapted for the German language with an extended context window.
13
+ - **Languages Supported:** German
14
+ - **Intended Use Cases:** The model is designed for downstream tasks involving text generation in German, such as language modeling and text completion
15
+
16
+ ## Model Details
17
+
18
+ - **Version:** 1.0 (Initial and likely final release)
19
+ - **Model Type:** Pre-trained language model
20
+ - **Tokenizer:** Utilizes the tokenizer from [stefan-it/german-gpt2-larger](https://huggingface.co/stefan-it/german-gpt2-larger), with the `pad_token` set to the `eos_token`.
21
+
22
+
23
+ ## Training Data
24
+
25
+ The model is trained on a large-scale German text corpus derived from Common Crawl data, filtered to include high-quality text such as newspapers and government websites.
26
+
27
+ - **Name:** German Colossal Clean Common Crawl (GC4) Corpus (filtered version)
28
+ - This model is trained on all of the **HEAD** files from the GC4 corpus.
29
+ - **Size:** Approximately 300 GB of text data.
30
+ - **Knowledge Cutoff:** 2020
31
+
32
+
33
+ ## Intended Use
34
+
35
+ - Base model for fine-tuning on German text generation tasks.
36
+ - Research in natural language processing for the German language.
37
+
38
+ **Users:** NLP researchers, developers, and practitioners focusing on German language applications.
39
+
40
+
41
+ ## Limitations
42
+
43
+ - **Biases:** The model may reflect biases present in the training data, including stereotypes or offensive content.
44
+ - **No Content Filtering:** The model lacks guardrails and may generate inappropriate or harmful text.
45
+ - **Outdated Information:** Contains knowledge up to 2020; does not include events after this date.
46
+
47
+
48
+ ## Ethical Considerations and Bias
49
+
50
+ **Disclaimer:**
51
+
52
+ The presented and trained language model is for **research purposes only**. The GC4 corpus—used for training—contains crawled texts from the internet. Thus, this GPT-2 model can be considered as highly biased, potentially encoding stereotypical associations along gender, race, ethnicity, and disability status. Before using and working with the released checkpoints, it is highly recommended to read:
53
+
54
+ - **"On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?"** by Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell.
55
+
56
+ **Potential Risks:**
57
+
58
+ - Generation of biased or offensive content.
59
+ - Misrepresentation of factual information.
60
+ - Propagation of stereotypes.
61
+
62
+ **Mitigation Strategies:**
63
+
64
+ - Implement bias mitigation techniques and content filtering appropriate for your use case.
65
+ - Use the model in controlled settings with human oversight.
66
+ - Perform thorough evaluation before deployment.
67
+
68
+
69
+ ## How to Use
70
+
71
+ ```python
72
+ from transformers import AutoModelForCausalLM, AutoTokenizer
73
+
74
+ model_name = "your-model-identifier" # Replace with your model's identifier
75
+ tokenizer = AutoTokenizer.from_pretrained("kkirchheim/german-gpt2-medium")
76
+ model = AutoModelForCausalLM.from_pretrained(model_name)
77
+ ```
78
+
79
+ **Code Example:**
80
+
81
+ ```python
82
+ from transformers import pipeline
83
+
84
+ model_name = "kkirchheim/german-gpt2-medium" # Replace with your model's identifier
85
+
86
+ generator = pipeline('text-generation', model=model_name, tokenizer=tokenizer)
87
+
88
+ prompt = "Das Leben ist schön, weil"
89
+ outputs = generator(prompt, max_length=50, num_return_sequences=1)
90
+
91
+ print(outputs[0]['generated_text'])
92
+ ```
93
+
94
+ ## Acknowledgments
95
+
96
+ - **Funding:** Chair of Software and Systems Engineering at the Otto-von-Guericke University of Magdeburg.
97
+ - **Contributors:** Konstantin Kirchheim
98
+ - **Third-Party Resources:**
99
+ - Tokenizer and initial model architecture from [stefan-it/german-gpt2-larger](https://huggingface.co/stefan-it/german-gpt2-larger).
100
+
101
+
102
+ ---
103
+
104
+ **Disclaimer:** This model is provided for **research purposes only** and comes with no warranties. The authors are not responsible for any output generated by the model. Users should exercise caution and are responsible for compliance with applicable laws and regulations.
105
+
106
+
107
+ **Changelog:**
108
+
109
+ - **Version 1.0:** Initial release.
110
+
111
+
112
+ **Future Work:**
113
+
114
+ - No planned updates or future releases at this time.
115
+
116
+
117
+ **Notes:**
118
+
119
+ - **Ethical Use:** Users should ensure that the model is used ethically and responsibly, considering potential impacts on individuals and society.
120
+ - **Legal Considerations:** Users should perform their own due diligence regarding the permissible use of the model and its outputs.
121
+
122
+
123
+ **FAQ:**
124
+
125
+ *Q: Can I use this model for commercial purposes?*
126
+
127
+ A: The model is intended for research purposes only. Commercial use is not advised without proper legal consultation.
128
+
129
+ *Q: How do I address potential biases in the model's outputs?*
130
+
131
+ A: Implement bias mitigation strategies and content filtering appropriate for your use case. Always review the model's outputs critically.
132
+
133
+
134
+
135
+ ## Training Procedure
136
+
137
+ **Hyperparameters:**
138
+
139
+ - **Optimizer:** AdamW (Torch implementation)
140
+ - **Learning Rate:** `6e-4`
141
+ - **Batch Size:**
142
+ - **Per Device Train Batch Size:** 12
143
+ - **Gradient Accumulation Steps:** 12
144
+ - **Effective Batch Size:** 144 (12 * 12)
145
+ - **Number of Epochs:** 1 (single pass over the dataset)
146
+ - **Warmup Steps:** 1,000
147
+ - **Weight Decay:** 0.1
148
+ - **Mixed Precision Training:** Enabled (`fp16=True`)
149
+ - **Gradient Checkpointing:** Enabled
150
+ - **Evaluation Strategy:**
151
+ - **Evaluation Steps:** Every 100 steps
152
+ - **Per Device Evaluation Batch Size:** 12
153
+ - **Logging:**
154
+ - **Logging Steps:** Every 10 steps
155
+ - **Logging to:** TensorBoard
156
+
157
+ **Training Hardware:**
158
+
159
+ - **Compute Resources:** 4 NVIDIA A100 GPUs with 40 GB memory each.
160
+ - **Training Duration:** Approximately 1.5 months.
161
+ - **Parallelism:** Distributed Data Parallel (DDP)
162
+
163
+ **Additional Training Details:**
164
+
165
+ - **Resume from Checkpoint:** Enabled (`resume_from_checkpoint=True`)
166
+ - **Optimizer Settings:**
167
+ - Used `adamw_torch` optimizer.