murawaki's picture
Update README.md
126748d
metadata
language: ja
license: cc-by-sa-4.0
library_name: transformers
tags:
  - gpt2
datasets:
  - wikipedia
  - cc100
  - oscar
widget:
  - text: 昨日私は京都で

Model Card for Japanese character-level GPT-2 Small

Model description

This is a Japanese character-level GPT-2 Small (90M parameters) language model pre-trained on Japanese Wikipedia, the Japanese portion of CC-100, and the Japanese portion of OSCAR.

How to use

You can use this model directly with a pipeline for text generation.

>>> from transformers import pipeline, set_seed
>>> generator = pipeline('text-generation', model='ku-nlp/gpt2-small-japanese-char')
>>> set_seed(5)
>>> generator("昨日私は京都で", max_length=30, do_sample=True, num_return_sequences=5)

[{'generated_text': '昨日私は京都で仕事していたんですけど、ある日突然京都にいる私'},
 {'generated_text': '昨日私は京都で就職し、母と一緒に奈良県の商工会議所に行ってき'},
 {'generated_text': '昨日私は京都ではありませんが、自分の住んでる事について色々と'},
 {'generated_text': '昨日私は京都では地図を見ることしかしない、京福電車のホームで'},
 {'generated_text': '昨日私は京都でこみちに住み始めた時からある不思議な現象で、そ'}]

You can also use this model to get the features of a given text.

Vocabulary

A character-level vocabulary of size 6K is used. To be precise, rare characters may be split into bytes because byte-level byte-pair encoding (BPE) is used. The BPE tokenizer was trained on a small subset of the training data. Since the data were converted into a one-character-per-line format, merge operations never go beyond character boundaries.

Note that the tokenizer maps U+0020 to [UNK] because preprocessing eliminated whitespace characters (U+0020) from training data.

Training data

We used the following corpora for pre-training:

  • Japanese Wikipedia (as of 20221020, 3.2GB, 27M sentences, 1.3M documents)
  • Japanese portion of CC-100 (85GB, 619M sentences, 66M documents)
  • Japanese portion of OSCAR (54GB, 326M sentences, 25M documents)

Note that we filtered out documents annotated with "header", "footer", or "noisy" tags in OSCAR. Also note that Japanese Wikipedia was duplicated 10 times to make the total size of the corpus comparable to that of CC-100 and OSCAR. As a result, the total size of the training data is 171GB.

Training procedure

The training took about 3 months (with two interruptions) with a single NVIDIA A100 80GB GPU.

The following hyperparameters were used during pre-training:

  • learning_rate: 2e-4
  • per_device_train_batch_size: 36
  • gradient_accumulation_steps: 32
  • optimizer: AdamW with betas=(0.9, 0.999) and epsilon=1e-06
  • weight_decay: 0.01
  • lr_scheduler_type: linear
  • max_grad_norm: 1.0
  • max_steps: 500,000 (but terminated at *** steps)
  • warmup_steps: 10,000

The eval loss was 1.60 while the eval accuracy was 0.635. The evaluation set consists of 5,000 randomly sampled documents from each of the training corpora.