okazaki-lab
/

japanese-gpt2-medium-unidic

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

tanidaiz commited on Mar 14, 2023

Commit

6ce97ad

•

1 Parent(s): 76b66b3

Update README.md

Files changed (1) hide show

README.md +63 -0

README.md CHANGED Viewed

@@ -1,3 +1,66 @@
 ---
 license: cc-by-sa-4.0
 ---

 ---
 license: cc-by-sa-4.0
 ---
+# japanese-gpt2-medium-unidic
+This is a medium-sized Japanese GPT-2 model using BERT-like tokenizer.
+# How to use
+The model depends on [PyTorch](https://pytorch.org/), [fugashi](https://github.com/polm/fugashi) with [unidic-lite](https://github.com/polm/unidic-lite), and [Hugging Face Transformers](https://github.com/huggingface/transformers).
+```sh
+pip install torch torchvision torchaudio
+pip install fugashi[unidic-lite]
+pip install transformers
+```
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+tokenizer = AutoTokenizer.from_pretrained('original')
+model = AutoModelForCausalLM.from_pretrained('original')
+text = '今日はいい天気なので、'
+bos = tokenizer.convert_tokens_to_ids(['[BOS]']) # [32768]
+input_ids = bos + tokenizer.encode(text)[1:-1] # [CLS] and [SEP] generated by BERT Tokenizer are removed
+input_ids = torch.tensor(input_ids).unsqueeze(0)
+output = model.generate(
+    input_ids,
+    do_sample=True,
+    max_new_tokens=30,
+    top_k=50,
+    top_p=0.95,
+    repetition_penalty=1.0,
+    num_return_sequences=1,
+    pad_token_id=0,
+    eos_token_id=32769,
+)[0]
+print(tokenizer.decode(output))
+```
+# Model architecture
+Transformer-based Language Model
+- Layers: 24
+- Heads: 16
+- Dimensions of hidden states: 1024
+# Training
+We used a [codebase](https://github.com/rinnakk/japanese-pretrained-models) provided by rinna Co., Ltd. for training.
+The model was trained on Japanese CC-100 and Japanese Wikipedia (2022/01/31).
+We employed 8 A100 GPUs for 17 days.
+The perplexity on the validation set is 9.80.
+# Tokenization
+Our tokenizer is based on [the one](https://huggingface.co/cl-tohoku/bert-base-japanese-v2)  provided by Tohoku NLP Group.
+The texts are tokenized by MeCab and then WordPiece.
+The vocabulary size is 32771 (32768 original tokens + 2 special tokens + 1 unused token).
+# License
+[Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/)
+Copyright (c) 2021, Tohoku University
+Copyright (c) 2023, Tokyo Institute of Technology