File size: 5,340 Bytes
101d6a0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
language: ja
thumbnail: https://github.com/ycat3/japanese-pretrained-models/blob/master/jweb.png
tags:
- ja
- japanese
- gpt2
- text-generation
- lm
- nlp
- rust
- rust-bert

license: mit

datasets:
- cc100
- wikipedia
- AozoraBunko

widget:
- text: "夏目漱石は、"
---

# japanese-soseki-gpt2-1b

![jweb-icon](./jweb.png)

This repository provides a 1.3B-parameter finetuned Japanese GPT2 model.
The model was finetuned by [jweb](https://jweb.asia/) based on trained by [rinna Co., Ltd.](https://corp.rinna.co.jp/)
Both pytorch(pytorch_model.bin) and Rust(rust_model.ot) models are provided

# How to use the model

*NOTE:* Use `T5Tokenizer` to initiate the tokenizer.

python
~~~~
import torch
from transformers import T5Tokenizer, AutoModelForCausalLM

tokenizer = T5Tokenizer.from_pretrained("jweb/japanese-soseki-gpt2-1b")
model = AutoModelForCausalLM.from_pretrained("jweb/japanese-soseki-gpt2-1b")

if torch.cuda.is_available():
    model = model.to("cuda")

text = "夏目漱石は、"
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        token_ids.to(model.device),
        max_length=128,
        min_length=40,
        do_sample=True,
        repetition_penalty= 1.6,
        early_stopping= True,
        num_beams= 5,
        temperature= 1.0,
        top_k=500,
        top_p=0.95,
        pad_token_id=tokenizer.pad_token_id,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

output = tokenizer.decode(output_ids.tolist()[0])
print(output)  
# sample output: 夏目漱石は、明治時代を代表する文豪です。夏目漱石の代表作は「吾輩は猫である」や「坊っちゃん」、「草枕」「三四郎」、それに「虞美人草(ぐびじんそう)」などたくさんあります。
~~~~

rust
~~~~
use rust_bert::gpt2::GPT2Generator;
use rust_bert::pipelines::common::{ModelType, TokenizerOption};
use rust_bert::pipelines::generation_utils::{GenerateConfig, LanguageGenerator};
use rust_bert::resources::{ RemoteResource,  ResourceProvider};
use tch::Device;

fn main() -> anyhow::Result<()> {
    let model_resource = Box::new(RemoteResource {     
        url: "https://huggingface.co/jweb/japanese-soseki-gpt2-1b/resolve/main/rust_model.ot".into(),
        cache_subdir: "japanese-soseki-gpt2-1b/model".into(),        
    });
    let config_resource = Box::new(RemoteResource {     
        url: "https://huggingface.co/jweb/japanese-soseki-gpt2-1b/resolve/main/config.json".into(),
        cache_subdir: "japanese-soseki-gpt2-1b/config".into(),        
    });
    let vocab_resource = Box::new(RemoteResource {     
        url: "https://huggingface.co/jweb/japanese-soseki-gpt2-1b/resolve/main/spiece.model".into(),
        cache_subdir: "japanese-soseki-gpt2-1b/vocab".into(),        
    });
    let vocab_resource_token = vocab_resource.clone();
    let merges_resource = vocab_resource.clone();    
    let generate_config = GenerateConfig {        
        model_resource,
        config_resource,
        vocab_resource,
        merges_resource, // not used        
        device: Device::Cpu,
        repetition_penalty: 1.6,
        min_length: 40,
        max_length: 128,
        do_sample: true,
        early_stopping: true,
        num_beams: 5,
        temperature: 1.0,
        top_k: 500,
        top_p: 0.95,
        ..Default::default()
    };
    let tokenizer = TokenizerOption::from_file(
        ModelType::T5,
        vocab_resource_token.get_local_path().unwrap().to_str().unwrap(),
        None,
        true,
        None,
        None,
    )?;
    let mut gpt2_model = GPT2Generator::new_with_tokenizer(generate_config, tokenizer.into())?;
    gpt2_model.set_device(Device::cuda_if_available());
    let input_text = "夏目漱石は、";
    let t1 = std::time::Instant::now();
    let output = gpt2_model.generate(Some(&[input_text]), None);
    println!("{}", output[0].text);
    println!("Elapsed Time(ms):{}",t1.elapsed().as_millis()); 
    Ok(())
}
// sample output: 夏目漱石は、明治から大正にかけて活躍した日本の小説家です。彼は「吾輩は猫である」や「坊っちゃん」、「草枕」「三四郎」、あるいは「虞美人草」などの小説で知られていますが、「明暗」のような小説も書いていました。
~~~~

# Model architecture
A 24-layer, 2048-hidden-size transformer-based language model.

# Training
The model was trained on [Japanese C4](https://huggingface.co/datasets/allenai/c4), [Japanese CC-100](http://data.statmt.org/cc-100/ja.txt.xz) and [Japanese Wikipedia](https://dumps.wikimedia.org/other/cirrussearch) to optimize a traditional language modelling objective. It reaches around 14 perplexity on a chosen validation set from the same data.
# Finetuning
The model was finetuned on [Aozorabunko](https://github.com/aozorabunko/aozorabunko), especially Natume Soseki books.
# Tokenization
The model uses a [sentencepiece](https://github.com/google/sentencepiece)-based tokenizer. The vocabulary was first trained on a selected subset from the training data using the official sentencepiece training script, and then augmented with emojis and symbols.
# Licenese
[The MIT license](https://opensource.org/licenses/MIT)