File size: 5,340 Bytes
5ba4f64 de9952d 5ba4f64 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 |
---
language: ja
thumbnail: https://github.com/ycat3/japanese-pretrained-models/blob/master/jweb.png
tags:
- ja
- japanese
- gpt2
- text-generation
- lm
- nlp
- rust
- rust-bert
license: mit
datasets:
- cc100
- wikipedia
- AozoraBunko
widget:
- text: "夏目漱石は、"
---
# japanese-soseki-gpt2-1b
![jweb-icon](./jweb.png)
This repository provides a 1.3B-parameter finetuned Japanese GPT2 model.
The model was finetuned by [jweb](https://jweb.asia/) based on trained by [rinna Co., Ltd.](https://corp.rinna.co.jp/)
Both pytorch(pytorch_model.bin) and Rust(rust_model.ot) models are provided
# How to use the model
*NOTE:* Use `T5Tokenizer` to initiate the tokenizer.
python
~~~~
import torch
from transformers import T5Tokenizer, AutoModelForCausalLM
tokenizer = T5Tokenizer.from_pretrained("jweb/japanese-soseki-gpt2-1b")
model = AutoModelForCausalLM.from_pretrained("jweb/japanese-soseki-gpt2-1b")
if torch.cuda.is_available():
model = model.to("cuda")
text = "夏目漱石は、"
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")
with torch.no_grad():
output_ids = model.generate(
token_ids.to(model.device),
max_length=128,
min_length=40,
do_sample=True,
repetition_penalty= 1.6,
early_stopping= True,
num_beams= 5,
temperature= 1.0,
top_k=500,
top_p=0.95,
pad_token_id=tokenizer.pad_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
)
output = tokenizer.decode(output_ids.tolist()[0])
print(output)
# sample output: 夏目漱石は、明治時代を代表する文豪です。夏目漱石の代表作は「吾輩は猫である」や「坊っちゃん」、「草枕」「三四郎」、それに「虞美人草(ぐびじんそう)」などたくさんあります。
~~~~
rust
~~~~
use rust_bert::gpt2::GPT2Generator;
use rust_bert::pipelines::common::{ModelType, TokenizerOption};
use rust_bert::pipelines::generation_utils::{GenerateConfig, LanguageGenerator};
use rust_bert::resources::{ RemoteResource, ResourceProvider};
use tch::Device;
fn main() -> anyhow::Result<()> {
let model_resource = Box::new(RemoteResource {
url: "https://huggingface.co/jweb/japanese-soseki-gpt2-1b/resolve/main/rust_model.ot".into(),
cache_subdir: "japanese-soseki-gpt2-1b/model".into(),
});
let config_resource = Box::new(RemoteResource {
url: "https://huggingface.co/jweb/japanese-soseki-gpt2-1b/resolve/main/config.json".into(),
cache_subdir: "japanese-soseki-gpt2-1b/config".into(),
});
let vocab_resource = Box::new(RemoteResource {
url: "https://huggingface.co/jweb/japanese-soseki-gpt2-1b/resolve/main/spiece.model".into(),
cache_subdir: "japanese-soseki-gpt2-1b/vocab".into(),
});
let vocab_resource_token = vocab_resource.clone();
let merges_resource = vocab_resource.clone();
let generate_config = GenerateConfig {
model_resource,
config_resource,
vocab_resource,
merges_resource, // not used
device: Device::Cpu,
repetition_penalty: 1.6,
min_length: 40,
max_length: 128,
do_sample: true,
early_stopping: true,
num_beams: 5,
temperature: 1.0,
top_k: 500,
top_p: 0.95,
..Default::default()
};
let tokenizer = TokenizerOption::from_file(
ModelType::T5,
vocab_resource_token.get_local_path().unwrap().to_str().unwrap(),
None,
true,
None,
None,
)?;
let mut gpt2_model = GPT2Generator::new_with_tokenizer(generate_config, tokenizer.into())?;
gpt2_model.set_device(Device::cuda_if_available());
let input_text = "夏目漱石は、";
let t1 = std::time::Instant::now();
let output = gpt2_model.generate(Some(&[input_text]), None);
println!("{}", output[0].text);
println!("Elapsed Time(ms):{}",t1.elapsed().as_millis());
Ok(())
}
// sample output: 夏目漱石は、明治から大正にかけて活躍した日本の小説家です。彼は「吾輩は猫である」や「坊っちゃん」、「草枕」「三四郎」、あるいは「虞美人草」などの小説で知られていますが、「明暗」のような小説も書いていました。
~~~~
# Model architecture
A 24-layer, 2048-hidden-size transformer-based language model.
# Training
The model was trained on [Japanese C4](https://huggingface.co/datasets/allenai/c4), [Japanese CC-100](http://data.statmt.org/cc-100/ja.txt.xz) and [Japanese Wikipedia](https://dumps.wikimedia.org/other/cirrussearch) to optimize a traditional language modelling objective. It reaches around 14 perplexity on a chosen validation set from the same data.
# Finetuning
The model was finetuned on [Aozorabunko](https://github.com/aozorabunko/aozorabunko), especially Natume Soseki books.
# Tokenization
The model uses a [sentencepiece](https://github.com/google/sentencepiece)-based tokenizer. The vocabulary was first trained on a selected subset from the training data using the official sentencepiece training script, and then augmented with emojis and symbols.
# Licenese
[The MIT license](https://opensource.org/licenses/MIT)
|