jweb's picture
second commit
101d6a0
metadata
language: ja
thumbnail: https://github.com/ycat3/japanese-pretrained-models/blob/master/jweb.png
tags:
  - ja
  - japanese
  - gpt2
  - text-generation
  - lm
  - nlp
  - rust
  - rust-bert
license: mit
datasets:
  - cc100
  - wikipedia
  - AozoraBunko
widget:
  - text: 夏目漱石は、

japanese-soseki-gpt2-1b

jweb-icon

This repository provides a 1.3B-parameter finetuned Japanese GPT2 model. The model was finetuned by jweb based on trained by rinna Co., Ltd. Both pytorch(pytorch_model.bin) and Rust(rust_model.ot) models are provided

How to use the model

NOTE: Use T5Tokenizer to initiate the tokenizer.

python

import torch
from transformers import T5Tokenizer, AutoModelForCausalLM

tokenizer = T5Tokenizer.from_pretrained("jweb/japanese-soseki-gpt2-1b")
model = AutoModelForCausalLM.from_pretrained("jweb/japanese-soseki-gpt2-1b")

if torch.cuda.is_available():
    model = model.to("cuda")

text = "夏目漱石は、"
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        token_ids.to(model.device),
        max_length=128,
        min_length=40,
        do_sample=True,
        repetition_penalty= 1.6,
        early_stopping= True,
        num_beams= 5,
        temperature= 1.0,
        top_k=500,
        top_p=0.95,
        pad_token_id=tokenizer.pad_token_id,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

output = tokenizer.decode(output_ids.tolist()[0])
print(output)  
# sample output: 夏目漱石は、明治時代を代表する文豪です。夏目漱石の代表作は「吾輩は猫である」や「坊っちゃん」、「草枕」「三四郎」、それに「虞美人草(ぐびじんそう)」などたくさんあります。

rust

use rust_bert::gpt2::GPT2Generator;
use rust_bert::pipelines::common::{ModelType, TokenizerOption};
use rust_bert::pipelines::generation_utils::{GenerateConfig, LanguageGenerator};
use rust_bert::resources::{ RemoteResource,  ResourceProvider};
use tch::Device;

fn main() -> anyhow::Result<()> {
    let model_resource = Box::new(RemoteResource {     
        url: "https://huggingface.co/jweb/japanese-soseki-gpt2-1b/resolve/main/rust_model.ot".into(),
        cache_subdir: "japanese-soseki-gpt2-1b/model".into(),        
    });
    let config_resource = Box::new(RemoteResource {     
        url: "https://huggingface.co/jweb/japanese-soseki-gpt2-1b/resolve/main/config.json".into(),
        cache_subdir: "japanese-soseki-gpt2-1b/config".into(),        
    });
    let vocab_resource = Box::new(RemoteResource {     
        url: "https://huggingface.co/jweb/japanese-soseki-gpt2-1b/resolve/main/spiece.model".into(),
        cache_subdir: "japanese-soseki-gpt2-1b/vocab".into(),        
    });
    let vocab_resource_token = vocab_resource.clone();
    let merges_resource = vocab_resource.clone();    
    let generate_config = GenerateConfig {        
        model_resource,
        config_resource,
        vocab_resource,
        merges_resource, // not used        
        device: Device::Cpu,
        repetition_penalty: 1.6,
        min_length: 40,
        max_length: 128,
        do_sample: true,
        early_stopping: true,
        num_beams: 5,
        temperature: 1.0,
        top_k: 500,
        top_p: 0.95,
        ..Default::default()
    };
    let tokenizer = TokenizerOption::from_file(
        ModelType::T5,
        vocab_resource_token.get_local_path().unwrap().to_str().unwrap(),
        None,
        true,
        None,
        None,
    )?;
    let mut gpt2_model = GPT2Generator::new_with_tokenizer(generate_config, tokenizer.into())?;
    gpt2_model.set_device(Device::cuda_if_available());
    let input_text = "夏目漱石は、";
    let t1 = std::time::Instant::now();
    let output = gpt2_model.generate(Some(&[input_text]), None);
    println!("{}", output[0].text);
    println!("Elapsed Time(ms):{}",t1.elapsed().as_millis()); 
    Ok(())
}
// sample output: 夏目漱石は、明治から大正にかけて活躍した日本の小説家です。彼は「吾輩は猫である」や「坊っちゃん」、「草枕」「三四郎」、あるいは「虞美人草」などの小説で知られていますが、「明暗」のような小説も書いていました。

Model architecture

A 24-layer, 2048-hidden-size transformer-based language model.

Training

The model was trained on Japanese C4, Japanese CC-100 and Japanese Wikipedia to optimize a traditional language modelling objective. It reaches around 14 perplexity on a chosen validation set from the same data.

Finetuning

The model was finetuned on Aozorabunko, especially Natume Soseki books.

Tokenization

The model uses a sentencepiece-based tokenizer. The vocabulary was first trained on a selected subset from the training data using the official sentencepiece training script, and then augmented with emojis and symbols.

Licenese

The MIT license