File size: 2,422 Bytes
76b66b3
 
c49548c
 
 
 
 
 
 
 
 
 
 
 
76b66b3
6ce97ad
 
 
13cf93c
 
6ce97ad
 
 
 
 
 
 
 
 
 
 
 
13cf93c
 
6ce97ad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c49548c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
license: cc-by-sa-4.0
datasets:
- wikipedia
- cc100
language:
- ja
pipeline_tag: text-generation
tags:
- gpt
- japanese
- language model
widget:
- text: 今日はいい天気なので、
---
# japanese-gpt2-medium-unidic
This is a medium-sized Japanese GPT-2 model using BERT-like tokenizer.

Reversed version is published [here](https://huggingface.co/okazaki-lab/japanese-reversed-gpt2-medium-unidic/).

# How to use
The model depends on [PyTorch](https://pytorch.org/), [fugashi](https://github.com/polm/fugashi) with [unidic-lite](https://github.com/polm/unidic-lite), and [Hugging Face Transformers](https://github.com/huggingface/transformers).

```sh
pip install torch torchvision torchaudio
pip install fugashi[unidic-lite]
pip install transformers
```

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained('okazaki-lab/japanese-gpt2-medium-unidic')
model = AutoModelForCausalLM.from_pretrained('okazaki-lab/japanese-gpt2-medium-unidic')

text = '今日はいい天気なので、'

bos = tokenizer.convert_tokens_to_ids(['[BOS]']) # [32768]
input_ids = bos + tokenizer.encode(text)[1:-1] # [CLS] and [SEP] generated by BERT Tokenizer are removed
input_ids = torch.tensor(input_ids).unsqueeze(0)
output = model.generate(
    input_ids,
    do_sample=True,
    max_new_tokens=30,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.0,
    num_return_sequences=1,
    pad_token_id=0,
    eos_token_id=32769,
)[0]

print(tokenizer.decode(output))
```

# Model architecture
Transformer-based Language Model
- Layers: 24
- Heads: 16
- Dimensions of hidden states: 1024

# Training
We used a [codebase](https://github.com/rinnakk/japanese-pretrained-models) provided by rinna Co., Ltd. for training.

The model was trained on Japanese CC-100 and Japanese Wikipedia (2022/01/31).
We employed 8 A100 GPUs for 17 days.
The perplexity on the validation set is 9.80.

# Tokenization
Our tokenizer is based on [the one](https://huggingface.co/cl-tohoku/bert-base-japanese-v2)  provided by Tohoku NLP Group.
The texts are tokenized by MeCab and then WordPiece.

The vocabulary size is 32771 (32768 original tokens + 2 special tokens + 1 unused token).

# License
[Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/)

Copyright (c) 2021, Tohoku University

Copyright (c) 2023, Tokyo Institute of Technology