--- license: cc-by-sa-4.0 datasets: - wikipedia - cc100 language: - ja pipeline_tag: text-generation tags: - gpt - japanese - language model - reversed gpt-2 inference: false --- # japanese-reversed-gpt2-medium-unidic This is a medium-sized Japanese **reversed** GPT-2 model using BERT-like tokenizer. Unlike most Language Models, this model generates sentences from right to left. Not reversed version is published [here](https://huggingface.co/okazaki-lab/japanese-gpt2-medium-unidic/). # How to use The model depends on [PyTorch](https://pytorch.org/), [fugashi](https://github.com/polm/fugashi) with [unidic-lite](https://github.com/polm/unidic-lite), and [Hugging Face Transformers](https://github.com/huggingface/transformers). ```sh pip install torch torchvision torchaudio pip install fugashi[unidic-lite] pip install transformers ``` ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained('okazaki-lab/japanese-reversed-gpt2-medium-unidic') model = AutoModelForCausalLM.from_pretrained('okazaki-lab/japanese-reversed-gpt2-medium-unidic') text = 'ので、散歩に行きました。' bos = tokenizer.convert_tokens_to_ids(['[BOS]']) # [32768] input_ids = bos + tokenizer.encode(text)[1:-1][::-1] # [CLS] and [SEP] generated by BERT Tokenizer are removed then reversed input_ids = torch.tensor(input_ids).unsqueeze(0) output = model.generate( input_ids, do_sample=True, max_new_tokens=30, top_k=50, top_p=0.95, repetition_penalty=1.0, num_return_sequences=1, pad_token_id=0, eos_token_id=32769, )[0].flip(0) print(tokenizer.decode(output)) ``` # Model architecture Transformer-based Language Model - Layers: 24 - Heads: 16 - Dimensions of hidden states: 1024 # Training We used a [codebase](https://github.com/rinnakk/japanese-pretrained-models) provided by rinna Co., Ltd. for training. The model was trained on Japanese CC-100 and Japanese Wikipedia (2022/01/31). We employed 8 A100 GPUs for 17 days. The perplexity on the validation set is 9.79. # Tokenization Our tokenizer is based on [the one](https://huggingface.co/cl-tohoku/bert-base-japanese-v2) provided by Tohoku NLP Group. The texts are tokenized by MeCab and then WordPiece. The vocabulary size is 32771 (32768 original tokens + 2 special tokens + 1 unused token). # License [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/) Copyright (c) 2021, Tohoku University Copyright (c) 2023, Tokyo Institute of Technology