rmihaylov commited on
Commit
5db5a5d
1 Parent(s): a0a9adf

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +85 -0
README.md ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ inference: false
3
+ language:
4
+ - bg
5
+ license: mit
6
+ datasets:
7
+ - oscar
8
+ - chitanka
9
+ - wikipedia
10
+ tags:
11
+ - torch
12
+ ---
13
+
14
+ # GPT-2
15
+
16
+ Pretrained model on Bulgarian language using a causal language modeling (CLM) objective. It was introduced in
17
+ [this paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
18
+ and first released at [this page](https://openai.com/blog/better-language-models/).
19
+
20
+ ## Model description
21
+
22
+ This is the **MEDIUM** version.
23
+
24
+ The training data is Bulgarian text from [OSCAR](https://oscar-corpus.com/post/oscar-2019/), [Chitanka](https://chitanka.info/) and [Wikipedia](https://bg.wikipedia.org/).
25
+
26
+ ## Intended uses & limitations
27
+
28
+ You can use the raw model for:
29
+ - text generation
30
+ - auto-complete
31
+ - spelling correction
32
+
33
+ Or fine-tune it to a downstream task.
34
+
35
+ ### How to use
36
+
37
+ Here is how to use this model in PyTorch:
38
+
39
+ ```python
40
+ >>> from transformers import AutoModel, AutoTokenizer
41
+ >>>
42
+ >>> model_id = "rmihaylov/gpt2-medium-bg"
43
+ >>> tokenizer = AutoTokenizer.from_pretrained(model_id)
44
+ >>> model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
45
+ >>>
46
+ >>> input_ids = tokenizer.encode(
47
+ >>> "Здравей,",
48
+ >>> add_special_tokens=False,
49
+ >>> return_tensors='pt')
50
+ >>>
51
+ >>> output_ids = model.generate(
52
+ >>> input_ids,
53
+ >>> do_sample=True,
54
+ >>> max_length=50,
55
+ >>> top_p=0.92,
56
+ >>> pad_token_id=2,
57
+ >>> top_k=0)
58
+ >>>
59
+ >>> output = tokenizer.decode(output_ids[0])
60
+ >>>
61
+ >>> output = output.replace('<|endoftext|>', '\n\n\n')
62
+ >>> output = output.replace('<|unknown|>', '')
63
+ >>> output = output.replace('▁', ' ')
64
+ >>> output = output.replace('<|n|>', '\n')
65
+ >>>
66
+ >>> print(output)
67
+
68
+ Здравей, господин Фиш. — Добс забеляза как пребледня Ривера.
69
+ — Не си тръгвайте още. Имам да ви задам няколко въпроса.
70
+ — Благодаря, благодаря. — Фиш не изчака да му покаже, че е забелязал жеста й
71
+ ```
72
+
73
+ ### Limitations and bias
74
+
75
+ As the openAI team themselves point out in their
76
+ [model card](https://github.com/openai/gpt-2/blob/master/model_card.md#out-of-scope-use-cases):
77
+
78
+ > Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases
79
+ > that require the generated text to be true.
80
+ >
81
+ > Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do
82
+ > not recommend that they be deployed into systems that interact with humans > unless the deployers first carry out a
83
+ > study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race,
84
+ > and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar
85
+ > levels of caution around use cases that are sensitive to biases around human attributes.