File size: 2,473 Bytes
647ca25
fdb74d2
eeb492b
fdb74d2
 
647ca25
fdb74d2
dab1798
21a75b2
f9d85ab
647ca25
02a4bad
912271f
3e5c4b1
912271f
b28f7ec
 
3e5c4b1
b28f7ec
02a4bad
b28f7ec
 
02a4bad
b28f7ec
 
32931d6
b28f7ec
02a4bad
b28f7ec
992863f
 
b28f7ec
 
 
 
 
 
 
 
 
 
f6fc929
b28f7ec
992863f
 
b28f7ec
 
 
 
 
 
 
 
 
 
 
02a4bad
54768a0
 
b28f7ec
 
54768a0
b28f7ec
02a4bad
b28f7ec
3cfbcd0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
---
datasets: 
  - wikipedia
language: 
  - lt
license: apache-2.0
tags: 
  - "text-generation"
widget:
  - text: "Lietuva yra viena "
---
## Model description

![LT](LT.png)

GPT-2 model from Lithuania using Wikipedia corpus dataset based on GPT-2 small model.

This is only the first version of the model; over time model will be improved using a more extensive dataset and better data preparation.

## Training data
This model was pre-trained with 180MB of Lithuanian Wikipedia. The texts are tokenized using a byte-level version of Byte Pair Encoding (BPE).

## Training
The model was trained on wiki-corpus for 40 hours using NVIDIA Tesla P100 GPU.

### How to use

### Load model

``` 
from transformers import AutoTokenizer, TFAutoModelWithLMHead
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("DeividasM/gpt2_lithuanian_small")
model = TFAutoModelWithLMHead.from_pretrained("DeividasM/gpt2_lithuanian_small")

# Get sequence length max of 1024
tokenizer.model_max_length=1024 

model.eval()
```
## Generate text

``` 
text = "tekstas "
inputs = tokenizer.encode(text, return_tensors="tf")


outputs = model.generate(inputs, eos_token_id=50256, pad_token_id=50256, 
                         do_sample=True,
                         max_length=40,
                         top_k=40)
                         
print(tokenizer.decode(outputs[0]))

```
## Limitations and bias
The training data used for this model come from Lithuanian Wikipedia. We know it contains a lot of unfiltered content from the internet, which is far from neutral. As the OpenAI team themselves point out in their model card:


"Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases that require the generated text to be true. Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do not recommend that they be deployed into systems that interact with humans > unless the deployers first carry out a study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race, and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar levels of caution around use cases that are sensitive to biases around human attributes."


## Author

Lithuanian GPT-2 small was trained and evaluated by Deividas Mataciunas (https://www.linkedin.com/in/deividasmataciunas/)