File size: 1,633 Bytes
bd1ceff |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
---
tags:
- translation
- japanese
language:
- ja
- en
license: mit
widget:
- text: "今日もご安全に"
---
## mbart-ja-en
このモデルは[facebook/mbart-large-cc25](https://huggingface.co/facebook/mbart-large-cc25)をベースに[JESC dataset](https://nlp.stanford.edu/projects/jesc/index_ja.html)でファインチューニングしたものです。
This model is based on [facebook/mbart-large-cc25](https://huggingface.co/facebook/mbart-large-cc25) and fine-tuned with [JESC dataset](https://nlp.stanford.edu/projects/jesc/index_ja.html).
## How to use
```py
from transformers import (
MBartForConditionalGeneration, MBartTokenizer
)
tokenizer = MBartTokenizer.from_pretrained("ken11/mbart-ja-en")
model = MBartForConditionalGeneration.from_pretrained("ken11/mbart-ja-en")
inputs = tokenizer("こんにちは", return_tensors="pt")
translated_tokens = model.generate(**inputs, decoder_start_token_id=tokenizer.lang_code_to_id["en_XX"], early_stopping=True, max_length=48)
pred = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
print(pred)
```
## Training Data
I used the [JESC dataset](https://nlp.stanford.edu/projects/jesc/index_ja.html) for training.
Thank you for publishing such a large dataset.
## Tokenizer
The tokenizer uses the [sentencepiece](https://github.com/google/sentencepiece) trained on the JESC dataset.
## Note
The result of evaluating the sacrebleu score for [JEC Basic Sentence Data of Kyoto University](https://nlp.ist.i.kyoto-u.ac.jp/EN/?JEC+Basic+Sentence+Data#i0163896) was `18.18` .
## Licenese
[The MIT license](https://opensource.org/licenses/MIT)
|