|
--- |
|
language: ja |
|
thumbnail: https://github.com/ycat3/japanese-pretrained-models/blob/master/jweb.png |
|
tags: |
|
- ja |
|
- japanese |
|
- gpt2 |
|
- text-generation |
|
- lm |
|
- nlp |
|
- rust |
|
- rust-bert |
|
|
|
license: mit |
|
|
|
datasets: |
|
- cc100 |
|
- wikipedia |
|
- AozoraBunko |
|
|
|
widget: |
|
- text: "夏目漱石は、" |
|
--- |
|
|
|
# japanese-soseki-gpt2-1b |
|
|
|
![jweb-icon](./jweb.png) |
|
|
|
This repository provides a 1.3B-parameter finetuned Japanese GPT2 model. |
|
The model was finetuned by [jweb](https://jweb.asia/) based on trained by [rinna Co., Ltd.](https://corp.rinna.co.jp/) |
|
Both pytorch(pytorch_model.bin) and Rust(rust_model.ot) models are provided |
|
|
|
# How to use the model |
|
|
|
*NOTE:* Use `T5Tokenizer` to initiate the tokenizer. |
|
|
|
python |
|
~~~~ |
|
import torch |
|
from transformers import T5Tokenizer, AutoModelForCausalLM |
|
|
|
tokenizer = T5Tokenizer.from_pretrained("jweb/japanese-soseki-gpt2-1b") |
|
model = AutoModelForCausalLM.from_pretrained("jweb/japanese-soseki-gpt2-1b") |
|
|
|
if torch.cuda.is_available(): |
|
model = model.to("cuda") |
|
|
|
text = "夏目漱石は、" |
|
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt") |
|
|
|
with torch.no_grad(): |
|
output_ids = model.generate( |
|
token_ids.to(model.device), |
|
max_length=128, |
|
min_length=40, |
|
do_sample=True, |
|
repetition_penalty= 1.6, |
|
early_stopping= True, |
|
num_beams= 5, |
|
temperature= 1.0, |
|
top_k=500, |
|
top_p=0.95, |
|
pad_token_id=tokenizer.pad_token_id, |
|
bos_token_id=tokenizer.bos_token_id, |
|
eos_token_id=tokenizer.eos_token_id, |
|
) |
|
|
|
output = tokenizer.decode(output_ids.tolist()[0]) |
|
print(output) |
|
# sample output: 夏目漱石は、明治時代を代表する文豪です。夏目漱石の代表作は「吾輩は猫である」や「坊っちゃん」、「草枕」「三四郎」、それに「虞美人草(ぐびじんそう)」などたくさんあります。 |
|
~~~~ |
|
|
|
rust |
|
~~~~ |
|
use rust_bert::gpt2::GPT2Generator; |
|
use rust_bert::pipelines::common::{ModelType, TokenizerOption}; |
|
use rust_bert::pipelines::generation_utils::{GenerateConfig, LanguageGenerator}; |
|
use rust_bert::resources::{ RemoteResource, ResourceProvider}; |
|
use tch::Device; |
|
|
|
fn main() -> anyhow::Result<()> { |
|
let model_resource = Box::new(RemoteResource { |
|
url: "https://huggingface.co/jweb/japanese-soseki-gpt2-1b/resolve/main/rust_model.ot".into(), |
|
cache_subdir: "japanese-soseki-gpt2-1b/model".into(), |
|
}); |
|
let config_resource = Box::new(RemoteResource { |
|
url: "https://huggingface.co/jweb/japanese-soseki-gpt2-1b/resolve/main/config.json".into(), |
|
cache_subdir: "japanese-soseki-gpt2-1b/config".into(), |
|
}); |
|
let vocab_resource = Box::new(RemoteResource { |
|
url: "https://huggingface.co/jweb/japanese-soseki-gpt2-1b/resolve/main/spiece.model".into(), |
|
cache_subdir: "japanese-soseki-gpt2-1b/vocab".into(), |
|
}); |
|
let vocab_resource_token = vocab_resource.clone(); |
|
let merges_resource = vocab_resource.clone(); |
|
let generate_config = GenerateConfig { |
|
model_resource, |
|
config_resource, |
|
vocab_resource, |
|
merges_resource, // not used |
|
device: Device::Cpu, |
|
repetition_penalty: 1.6, |
|
min_length: 40, |
|
max_length: 128, |
|
do_sample: true, |
|
early_stopping: true, |
|
num_beams: 5, |
|
temperature: 1.0, |
|
top_k: 500, |
|
top_p: 0.95, |
|
..Default::default() |
|
}; |
|
let tokenizer = TokenizerOption::from_file( |
|
ModelType::T5, |
|
vocab_resource_token.get_local_path().unwrap().to_str().unwrap(), |
|
None, |
|
true, |
|
None, |
|
None, |
|
)?; |
|
let mut gpt2_model = GPT2Generator::new_with_tokenizer(generate_config, tokenizer.into())?; |
|
gpt2_model.set_device(Device::cuda_if_available()); |
|
let input_text = "夏目漱石は、"; |
|
let t1 = std::time::Instant::now(); |
|
let output = gpt2_model.generate(Some(&[input_text]), None); |
|
println!("{}", output[0].text); |
|
println!("Elapsed Time(ms):{}",t1.elapsed().as_millis()); |
|
Ok(()) |
|
} |
|
// sample output: 夏目漱石は、明治から大正にかけて活躍した日本の小説家です。彼は「吾輩は猫である」や「坊っちゃん」、「草枕」「三四郎」、あるいは「虞美人草」などの小説で知られていますが、「明暗」のような小説も書いていました。 |
|
~~~~ |
|
|
|
# Model architecture |
|
A 24-layer, 2048-hidden-size transformer-based language model. |
|
|
|
# Training |
|
The model was trained on [Japanese C4](https://huggingface.co/datasets/allenai/c4), [Japanese CC-100](http://data.statmt.org/cc-100/ja.txt.xz) and [Japanese Wikipedia](https://dumps.wikimedia.org/other/cirrussearch) to optimize a traditional language modelling objective. It reaches around 14 perplexity on a chosen validation set from the same data. |
|
# Finetuning |
|
The model was finetuned on [Aozorabunko](https://github.com/aozorabunko/aozorabunko), especially Natume Soseki books. |
|
# Tokenization |
|
The model uses a [sentencepiece](https://github.com/google/sentencepiece)-based tokenizer. The vocabulary was first trained on a selected subset from the training data using the official sentencepiece training script, and then augmented with emojis and symbols. |
|
# Licenese |
|
[The MIT license](https://opensource.org/licenses/MIT) |
|
|