File size: 5,778 Bytes
c361cda
2110e9c
 
 
 
 
 
 
c361cda
 
2110e9c
 
 
 
291663d
 
 
2110e9c
 
 
 
 
 
 
 
 
 
 
 
 
35c6f0d
2110e9c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35c6f0d
 
 
 
 
2110e9c
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
language: ja
tags:
- ja
- japanese
- bart
- lm
- nlp
license: mit
---

# bart-base-japanese-news(base-sized model)
This repository provides a Japanese BART model. The model was trained by [Stockmark Inc.](https://stockmark.co.jp)

An introductory article on the model can be found at the following URL.

[https://tech.stockmark.co.jp/blog/bart-japanese-base-news/](https://tech.stockmark.co.jp/blog/bart-japanese-base-news/)

## Model description

BART is a transformer encoder-decoder (seq2seq) model with a bidirectional (BERT-like) encoder and an autoregressive (GPT-like) decoder. BART is pre-trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text.

BART is particularly effective when fine-tuned for text generation (e.g. summarization, translation) but also works well for comprehension tasks (e.g. text classification, question answering).

## Intended uses & limitations

You can use the raw model for text infilling. However, the model is mostly meant to be fine-tuned on a supervised dataset.

# How to use the model

*NOTE:* Since we are using a custom tokenizer, please use `trust_remote_code=True` to initialize the tokenizer.

## Simple use

```python
from transformers import AutoTokenizer, BartModel

model_name = "stockmark/bart-base-japanese-news"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = BartModel.from_pretrained(model_name)

inputs = tokenizer("今日は良い天気です。", return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state
```

## Sentence Permutation
```python
import torch
from transformers import AutoTokenizer, BartForConditionalGeneration

model_name = "stockmark/bart-base-japanese-news"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = BartForConditionalGeneration.from_pretrained(model_name)

if torch.cuda.is_available():
    model = model.to("cuda")

# correct order text is "明日は大雨です。電車は止まる可能性があります。ですから、自宅から働きます。"
text = "電車は止まる可能性があります。ですから、自宅から働きます。明日は大雨です。"

inputs = tokenizer([text], max_length=128, return_tensors="pt", truncation=True)
text_ids = model.generate(inputs["input_ids"].to(model.device), num_beams=3, max_length=128)
output = tokenizer.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

print(output)
# sample output: 明日は大雨です。電車は止まる可能性があります。ですから、自宅から働きます。
```
## Mask filing
```python
import torch
from transformers import AutoTokenizer, BartForConditionalGeneration

model_name = "stockmark/bart-base-japanese-news"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = BartForConditionalGeneration.from_pretrained(model_name)

if torch.cuda.is_available():
    model = model.to("cuda")

text = "今日の天気は<mask>のため、傘が必要でしょう。"

inputs = tokenizer([text], max_length=128, return_tensors="pt", truncation=True)
text_ids = model.generate(inputs["input_ids"].to(model.device), num_beams=3, max_length=128)
output = tokenizer.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

print(output)
# sample output: 今日の天気は、雨のため、傘が必要でしょう。
```

## Text generation

*NOTE:* You can use the raw model for text generation. However, the model is mostly meant to be fine-tuned on a supervised dataset.

```python
import torch
from transformers import AutoTokenizer, BartForConditionalGeneration

model_name = "stockmark/bart-base-japanese-news"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = BartForConditionalGeneration.from_pretrained(model_name)

if torch.cuda.is_available():
   model = model.to("cuda")

text = "自然言語処理(しぜんげんごしょり、略称:NLP)は、人間が日常的に使っている自然言語をコンピュータに処理させる一連の技術であり、人工知能と言語学の一分野である。「計算言語学」(computational linguistics)との類似もあるが、自然言語処理は工学的な視点からの言語処理をさすのに対して、計算言語学は言語学的視点を重視する手法をさす事が多い。"

inputs = tokenizer([text], max_length=512, return_tensors="pt", truncation=True)
text_ids = model.generate(inputs["input_ids"].to(model.device), num_beams=3, min_length=0, max_length=40)
output = tokenizer.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

print(output)
# sample output: 自然言語処理(しぜんげんごしょり、略称:NLP)は、人間が日常的に使っている自然言語をコンピュータに処理させる一連の技術であり、言語学の一分野である。
```

# Training
The model was trained on Japanese News Articles.

# Tokenization
The model uses a [sentencepiece](https://github.com/google/sentencepiece)-based tokenizer. The vocabulary was first trained on a selected subset from the training data using the official sentencepiece training script.

# Licenses
The pretrained models are distributed under the terms of the [MIT License](https://opensource.org/licenses/mit-license.php).

*NOTE:*  Only tokenization_bart_japanese_news.py is Apache License, Version 2.0.

# Contact
If you have any questions, please contact us using [our contact form](https://stockmark.co.jp/contact).

# Acknowledgement
This comparison study supported with Cloud TPUs from Google’s TensorFlow Research Cloud (TFRC).