metadata
tags:
- translation
- japanese
language:
- ja
- en
license: mit
widget:
- text: 今日もご安全に
mbart-ja-en
このモデルはfacebook/mbart-large-cc25をベースにJESC datasetでファインチューニングしたものです。
This model is based on facebook/mbart-large-cc25 and fine-tuned with JESC dataset.
How to use
from transformers import (
MBartForConditionalGeneration, MBartTokenizer
)
tokenizer = MBartTokenizer.from_pretrained("ken11/mbart-ja-en")
model = MBartForConditionalGeneration.from_pretrained("ken11/mbart-ja-en")
inputs = tokenizer("こんにちは", return_tensors="pt")
translated_tokens = model.generate(**inputs, decoder_start_token_id=tokenizer.lang_code_to_id["en_XX"], early_stopping=True, max_length=48)
pred = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
print(pred)
Training Data
I used the JESC dataset for training.
Thank you for publishing such a large dataset.
Tokenizer
The tokenizer uses the sentencepiece trained on the JESC dataset.
Note
The result of evaluating the sacrebleu score for JEC Basic Sentence Data of Kyoto University was 18.18
.