|
--- |
|
language: ja |
|
tags: |
|
- ja |
|
- japanese |
|
- bart |
|
- lm |
|
- nlp |
|
license: mit |
|
--- |
|
|
|
# bart-base-japanese-news(base-sized model) |
|
This repository provides a Japanese BART model. The model was trained by [Stockmark Inc.](https://stockmark.co.jp) |
|
|
|
|
|
## Model description |
|
|
|
BART is a transformer encoder-decoder (seq2seq) model with a bidirectional (BERT-like) encoder and an autoregressive (GPT-like) decoder. BART is pre-trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. |
|
|
|
BART is particularly effective when fine-tuned for text generation (e.g. summarization, translation) but also works well for comprehension tasks (e.g. text classification, question answering). |
|
|
|
## Intended uses & limitations |
|
|
|
You can use the raw model for text infilling. However, the model is mostly meant to be fine-tuned on a supervised dataset. |
|
|
|
# How to use the model |
|
|
|
*NOTE:* Use `trust_remote_code=True` to initiate the tokenizer. |
|
|
|
## Simple use |
|
|
|
```python |
|
from transformers import AutoTokenizer, BartModel |
|
|
|
model_name = "stockmark/bart-base-japanese-news" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
|
model = BartModel.from_pretrained(model_name) |
|
|
|
inputs = tokenizer("今日は良い天気です。", return_tensors="pt") |
|
outputs = model(**inputs) |
|
|
|
last_hidden_states = outputs.last_hidden_state |
|
``` |
|
|
|
## Sentence Permutation |
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, BartForConditionalGeneration |
|
|
|
model_name = "stockmark/bart-base-japanese-news" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
|
model = BartForConditionalGeneration.from_pretrained(model_name) |
|
|
|
if torch.cuda.is_available(): |
|
model = model.to("cuda") |
|
|
|
# correct order text is "明日は大雨です。電車は止まる可能性があります。ですから、自宅から働きます。" |
|
text = "電車は止まる可能性があります。ですから、自宅から働きます。明日は大雨です。" |
|
|
|
inputs = tokenizer([text], max_length=128, return_tensors="pt", truncation=True) |
|
text_ids = model.generate(inputs["input_ids"].to(model.device), num_beams=3, max_length=128) |
|
output = tokenizer.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] |
|
|
|
print(output) |
|
# sample output: 明日は大雨です。電車は止まる可能性があります。ですから、自宅から働きます。 |
|
``` |
|
## Mask filing |
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, BartForConditionalGeneration |
|
|
|
model_name = "stockmark/bart-base-japanese-news" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
|
model = BartForConditionalGeneration.from_pretrained(model_name) |
|
|
|
if torch.cuda.is_available(): |
|
model = model.to("cuda") |
|
|
|
text = "今日の天気は<mask>のため、傘が必要でしょう。" |
|
|
|
inputs = tokenizer([text], max_length=128, return_tensors="pt", truncation=True) |
|
text_ids = model.generate(inputs["input_ids"].to(model.device), num_beams=3, max_length=128) |
|
output = tokenizer.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] |
|
|
|
print(output) |
|
# sample output: 今日の天気は、雨のため、傘が必要でしょう。 |
|
``` |
|
|
|
## Text generation |
|
|
|
*NOTE:* You can use the raw model for text generation. However, the model is mostly meant to be fine-tuned on a supervised dataset. |
|
|
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, BartForConditionalGeneration |
|
|
|
model_name = "stockmark/bart-base-japanese-news" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
|
model = BartForConditionalGeneration.from_pretrained(model_name) |
|
|
|
if torch.cuda.is_available(): |
|
model = model.to("cuda") |
|
|
|
text = "自然言語処理(しぜんげんごしょり、略称:NLP)は、人間が日常的に使っている自然言語をコンピュータに処理させる一連の技術であり、人工知能と言語学の一分野である。「計算言語学」(computational linguistics)との類似もあるが、自然言語処理は工学的な視点からの言語処理をさすのに対して、計算言語学は言語学的視点を重視する手法をさす事が多い。" |
|
|
|
inputs = tokenizer([text], max_length=512, return_tensors="pt", truncation=True) |
|
text_ids = model.generate(inputs["input_ids"].to(model.device), num_beams=3, min_length=0, max_length=40) |
|
output = tokenizer.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] |
|
|
|
print(output) |
|
# sample output: 自然言語処理(しぜんげんごしょり、略称:NLP)は、人間が日常的に使っている自然言語をコンピュータに処理させる一連の技術であり、言語学の一分野である。 |
|
``` |
|
|
|
# Training |
|
The model was trained on Japanese News Articles. |
|
|
|
# Tokenization |
|
The model uses a [sentencepiece](https://github.com/google/sentencepiece)-based tokenizer. The vocabulary was first trained on a selected subset from the training data using the official sentencepiece training script. |
|
|
|
# Licenses |
|
The pretrained models are distributed under the terms of the [MIT License](https://opensource.org/licenses/mit-license.php). |
|
|
|
# Acknowledgement |
|
This comparison study supported with Cloud TPUs from Google’s TensorFlow Research Cloud (TFRC). |
|
|