lchaloupsky commited on
Commit
5c6c2e1
1 Parent(s): ab25b75

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +121 -2
README.md CHANGED
@@ -6,8 +6,127 @@ datasets:
6
  ---
7
 
8
  # Czech small GPT-2 model trained on the OSCAR dataset
9
- This model was trained as a part of the [master thesis](https://dspace.cuni.cz/handle/20.500.11956/176356?locale-attribute=en) on the Czech part of the [OSCAR](https://huggingface.co/datasets/oscar) dataset.
10
- More information will be added later.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
  ## Citation
13
  ```
 
6
  ---
7
 
8
  # Czech small GPT-2 model trained on the OSCAR dataset
9
+ This model was trained as a part of the [master thesis](https://dspace.cuni.cz/handle/20.500.11956/176356?locale-attribute=en) on the Czech part of the [OSCAR](https://huggingface.co/datasets/oscar) dataset.
10
+
11
+ ## Introduction
12
+ Czech-GPT2-OSCAR (Czech GPT-2 small) is a state-of-the-art language model for Czech based on the GPT-2 small model. Unlike the original GPT-2 small model, this model is trained to predict only 512 tokens instead of 1024 as it serves as a basis for the [Czech-GPT2-Medical|https://huggingface.co/lchaloupsky/czech-gpt2-medical].
13
+
14
+ The model was trained the Czech part of the [OSCAR](https://huggingface.co/datasets/oscar) dataset using Transfer Learning and Fine-tuning techniques in about a week on one NVIDIA A100 SXM4 40GB and with a total of 21 GB of training data.
15
+
16
+ This model was trained as a part of the master thesis as a proof-of-concept that it is possible to get a state-of-the-art language model in Czech language with smaller ressources than the original one, and in a significantly shorter time and mainly as a basis for the [Czech-GPT2-Medical](https://huggingface.co/lchaloupsky/czech-gpt2-medical) model. There was no Czech GPT-2 model available at the time the master thesis began.
17
+
18
+ It was fine-tuned from the English pre-trained GPT-2 small using the Hugging Face libraries (Transformers and Tokenizers) wrapped into the fastai v2 Deep Learning framework. All the fine-tuning fastai v2 techniques were used.
19
+ The solution is based on the [Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v2 (practical case with Portuguese)](https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hugging-f2ec05c98787) article.
20
+
21
+ Trained model is now available on Hugging Face under [czech-gpt2-oscar](https://huggingface.co/lchaloupsky/czech-gpt2-oscar/). For more information please let me know in the discussion.
22
+
23
+ ## Training/Evaluation
24
+ For more information on training the model or its evaluation, please have a look at the [thesis](https://dspace.cuni.cz/handle/20.500.11956/176356?locale-attribute=en) itself.
25
+
26
+ ## GPT-2 Model description
27
+ *Note: information copied/pasted from [Model: gpt2 >> Model description](https://huggingface.co/gpt2#model-description)*
28
+
29
+ GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.
30
+
31
+ More precisely, inputs are sequences of continuous text of a certain length and the targets are the same sequence, shifted one token (word or piece of word) to the right. The model uses internally a mask-mechanism to make sure the predictions for the token `i` only uses the inputs from `1` to `i` but not the future tokens.
32
+
33
+ This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating texts from a prompt.
34
+
35
+ ## How to use Czech-GPT2-OSCAR with HuggingFace (PyTorch)
36
+ *The following code use PyTorch. To use TensorFlow, check the below corresponding paragraph.*
37
+
38
+ ### Load Czech-GPT2-OSCAR and its sub-word tokenizer (Byte-level BPE)
39
+ ```python
40
+ from transformers import GPT2Tokenizer, GPT2LMHeadModel
41
+ import torch
42
+
43
+ tokenizer = GPT2Tokenizer.from_pretrained("lchaloupsky/czech-gpt2-oscar")
44
+ model = GPT2LMHeadModel.from_pretrained("lchaloupsky/czech-gpt2-oscar")
45
+
46
+ # Get sequence length max of 1024
47
+ tokenizer.model_max_length=1024
48
+ # For older versions of the 'transformers' library use this
49
+ # tokenizer.max_len=1024
50
+
51
+ model.eval() # disable dropout (or leave in train mode to finetune)
52
+ ```
53
+
54
+ ### Generate one word
55
+ ```python
56
+ # input sequence
57
+ text = "Praha je krásné město"
58
+ inputs = tokenizer(text, return_tensors="pt")
59
+
60
+ # model output
61
+ outputs = model(**inputs, labels=inputs["input_ids"])
62
+ loss, logits = outputs[:2]
63
+ predicted_index = torch.argmax(logits[0, -1, :]).item()
64
+ predicted_text = tokenizer.decode([predicted_index])
65
+
66
+ # results
67
+ print('input text:', text)
68
+ print('predicted text:', predicted_text)
69
+ ```
70
+
71
+ ### Generate one full sequence
72
+ ```python
73
+ # input sequence
74
+ text = "Praha je krásné město"
75
+ inputs = tokenizer(text, return_tensors="pt") # tokenizer.encode(text, return_tensors="pt") directly for input_ids
76
+
77
+ # model output using Top-k sampling text generation method
78
+ sample_outputs = model.generate(inputs.input_ids,
79
+ pad_token_id=50256,
80
+ do_sample=True,
81
+ max_length=50, # put the token number you want
82
+ top_k=40,
83
+ num_return_sequences=1)
84
+
85
+ # generated sequence
86
+ for i, sample_output in enumerate(sample_outputs):
87
+ print("{}\n\n{}".format(i+1, tokenizer.decode(sample_output.tolist()))) # tokenizer.decode(sample_output, skip_special_tokens=True)
88
+ ```
89
+
90
+ ## How to use Czech-GPT2-OSCAR with HuggingFace (TensorFlow)
91
+ *The following code use TensorFlow. To use PyTorch, check the above corresponding paragraph.*
92
+
93
+ ### Load Czech-GPT2-OSCAR and its sub-word tokenizer (Byte-level BPE)
94
+ ```python
95
+ from transformers import GPT2Tokenizer, TFGPT2LMHeadModel
96
+ import tensorflow as tf
97
+
98
+ tokenizer = GPT2Tokenizer.from_pretrained("lchaloupsky/czech-gpt2-oscar")
99
+ model = TFGPT2LMHeadModel.from_pretrained("lchaloupsky/czech-gpt2-oscar")
100
+
101
+ # Get sequence length max of 1024
102
+ tokenizer.model_max_length=1024
103
+ # For older versions of the 'transformers' library use this
104
+ # tokenizer.max_len=1024
105
+
106
+ model.eval() # disable dropout (or leave in train mode to finetune)
107
+ ```
108
+
109
+ ### Generate one full sequence
110
+ ```python
111
+ # input sequence
112
+ text = "Praha je krásné město"
113
+ input_ids = tokenizer.encode(text, return_tensors="tf")
114
+
115
+ # model output using Top-k sampling text generation method
116
+ outputs = model.generate(input_ids, eos_token_id=50256, pad_token_id=50256,
117
+ do_sample=True,
118
+ max_length=40,
119
+ top_k=40)
120
+ print(tokenizer.decode(outputs[0])) # tokenizer.decode(outputs[0], skip_special_tokens=True)
121
+ ```
122
+
123
+ ## Limitations and bias
124
+ The training data used for this model come from the Czech part of the OSCAR dataset. We know it contains a lot of unfiltered content from the internet, which is far from neutral. As the openAI team themselves point out in their model card:
125
+
126
+ > Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases that require the generated text to be true. Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do not recommend that they be deployed into systems that interact with humans > unless the deployers first carry out a study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race, and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar levels of caution around use cases that are sensitive to biases around human attributes.
127
+
128
+ ## Author
129
+ Czech GPT-2 OSCAR was trained and evaluated by [Lukáš Chaloupský](https://cz.linkedin.com/in/luk%C3%A1%C5%A1-chaloupsk%C3%BD-0016b8226?original_referer=https%3A%2F%2Fwww.google.com%2F) thanks to the computing power of the GPU (NVIDIA A100 SXM4 40GB) cluster of [IT4I](https://www.it4i.cz/) (VSB - Technical University of Ostrava).
130
 
131
  ## Citation
132
  ```