File size: 1,633 Bytes
bd1ceff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
---
tags:
- translation
- japanese

language:
- ja
- en

license: mit

widget:
- text: "今日もご安全に"

---
## mbart-ja-en
このモデルは[facebook/mbart-large-cc25](https://huggingface.co/facebook/mbart-large-cc25)をベースに[JESC dataset](https://nlp.stanford.edu/projects/jesc/index_ja.html)でファインチューニングしたものです。  
This model is based on [facebook/mbart-large-cc25](https://huggingface.co/facebook/mbart-large-cc25) and fine-tuned with [JESC dataset](https://nlp.stanford.edu/projects/jesc/index_ja.html).

## How to use
```py
from transformers import (
    MBartForConditionalGeneration, MBartTokenizer
)

tokenizer = MBartTokenizer.from_pretrained("ken11/mbart-ja-en")
model = MBartForConditionalGeneration.from_pretrained("ken11/mbart-ja-en")

inputs = tokenizer("こんにちは", return_tensors="pt")
translated_tokens = model.generate(**inputs, decoder_start_token_id=tokenizer.lang_code_to_id["en_XX"], early_stopping=True, max_length=48)
pred = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
print(pred)
```

## Training Data
I used the [JESC dataset](https://nlp.stanford.edu/projects/jesc/index_ja.html) for training.  
Thank you for publishing such a large dataset.  

## Tokenizer
The tokenizer uses the [sentencepiece](https://github.com/google/sentencepiece) trained on the JESC dataset.

## Note
The result of evaluating the sacrebleu score for [JEC Basic Sentence Data of Kyoto University](https://nlp.ist.i.kyoto-u.ac.jp/EN/?JEC+Basic+Sentence+Data#i0163896) was `18.18` .

## Licenese
[The MIT license](https://opensource.org/licenses/MIT)