Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,112 @@
|
|
1 |
-
---
|
2 |
-
license: cc-by-sa-4.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-sa-4.0
|
3 |
+
datasets:
|
4 |
+
- Mitsua/wikidata-parallel-descriptions-en-ja
|
5 |
+
language:
|
6 |
+
- ja
|
7 |
+
- en
|
8 |
+
metrics:
|
9 |
+
- bleu
|
10 |
+
- chrf
|
11 |
+
library_name: transformers
|
12 |
+
pipeline_tag: translation
|
13 |
+
---
|
14 |
+
# ElanMT
|
15 |
+
[**ElanMT-BT-en-ja**](https://huggingface.co/Mitsua/elan-mt-bt-en-ja) is a English to Japanese translation model developed by [ELAN MITSUA Project](https://elanmitsua.com/en/) / Abstract Engine.
|
16 |
+
- [**ElanMT-base-en-ja**](https://huggingface.co/Mitsua/elan-mt-base-en-ja) and [**ElanMT-base-ja-en**](https://huggingface.co/Mitsua/elan-mt-base-ja-en) are trained from scratch, exclusively on openly licensed corpora such as CC0, CC BY and CC BY-SA.
|
17 |
+
- This model is a fine-tuned checkpoint of **ElanMT-base-en-ja** and is trained exclusively on openly licensed data and Wikipedia back translated data using **ElanMT-base-ja-en**.
|
18 |
+
- Web crawled or other machine translated corpora are **not** used during the entire training procedure for the **ElanMT** models.
|
19 |
+
|
20 |
+
Despite the relatively low resource training, thanks to back-translation and [a newly built CC0 corpus](https://huggingface.co/datasets/Mitsua/wikidata-parallel-descriptions-en-ja),
|
21 |
+
the model achieved comparable performance to the currently available open translation models.
|
22 |
+
|
23 |
+
## Model Details
|
24 |
+
This is a translation model based on [Marian MT](https://marian-nmt.github.io/) 6-layer encoder-decoder transformer architecture with sentencepiece tokenizer.
|
25 |
+
- **Developed by**: [ELAN MITSUA Project](https://elanmitsua.com/en/) / Abstract Engine
|
26 |
+
- **Model type**: Translation
|
27 |
+
- **Source Language**: English
|
28 |
+
- **Target Language**: Japanese
|
29 |
+
- **License**: [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)
|
30 |
+
|
31 |
+
## Usage
|
32 |
+
1. Install the python packages
|
33 |
+
|
34 |
+
`pip install transformers accelerate sentencepiece`
|
35 |
+
|
36 |
+
* This model is verified on `transformers==4.40.2`
|
37 |
+
|
38 |
+
2. Run
|
39 |
+
|
40 |
+
```python
|
41 |
+
from transformers import pipeline
|
42 |
+
translator = pipeline('translation', model='Mitsua/elan-mt-bt-en-ja')
|
43 |
+
translator('Hello. I am an AI.')
|
44 |
+
```
|
45 |
+
|
46 |
+
3. For longer multiple sentences, using [pySBD](https://github.com/nipunsadvilkar/pySBD) is recommended.
|
47 |
+
|
48 |
+
`pip install transformers accelerate sentencepiece pysbd`
|
49 |
+
```python
|
50 |
+
import pysbd
|
51 |
+
seg_en = pysbd.Segmenter(language="en", clean=False)
|
52 |
+
txt = 'Hello. I am an AI. How are you doing?'
|
53 |
+
print(translator(seg_en.segment(txt)))
|
54 |
+
```
|
55 |
+
This idea is from [FuguMT](https://huggingface.co/staka/fugumt-en-ja) repo.
|
56 |
+
|
57 |
+
## Training Data
|
58 |
+
We heavily referred [FuguMT author's blog post](https://staka.jp/wordpress/?p=413) for dataset collection.
|
59 |
+
|
60 |
+
- [Mitsua/wikidata-parallel-descriptions-en-ja](https://huggingface.co/datasets/Mitsua/wikidata-parallel-descriptions-en-ja) (CC0 1.0)
|
61 |
+
- We newly built this 1.5M lines wikidata parallel corpus to augment the training data. This greatly improved the vocabulary on a word basis.
|
62 |
+
- [The Kyoto Free Translation Task (KFTT)](https://www.phontron.com/kftt/) (CC BY-SA 3.0)
|
63 |
+
- Graham Neubig, "The Kyoto Free Translation Task," http://www.phontron.com/kftt, 2011.
|
64 |
+
- [Tatoeba](https://tatoeba.org/en/downloads) (CC BY 2.0 FR / CC0 1.0)
|
65 |
+
- https://tatoeba.org/
|
66 |
+
- [wikipedia-interlanguage-titles](https://github.com/bhaddow/wikipedia-interlanguage-titles) (The MIT License / CC BY-SA 4.0)
|
67 |
+
- We built parallel titles based on 2024-05-06 wikipedia dump.
|
68 |
+
- [WikiMatrix](https://github.com/facebookresearch/LASER/tree/main/tasks/WikiMatrix) (CC BY-SA 4.0)
|
69 |
+
- Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Francisco Guzmán, "WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia"
|
70 |
+
- [MDN Web Docs](https://github.com/mdn/translated-content) (The MIT / CC0 1.0 / CC BY-SA 2.5)
|
71 |
+
- https://github.com/mdn/translated-content
|
72 |
+
- [Wikimedia contenttranslation dump](https://dumps.wikimedia.org/other/contenttranslation/) (CC BY-SA 4.0)
|
73 |
+
- 2024-5-10 dump is used.
|
74 |
+
|
75 |
+
*Even if the dataset itself is CC-licensed, we did not use it if the corpus contained in the dataset is based on web crawling, is based on unauthorized use of copyrighted works, or is based on the machine translation output of other translation models.
|
76 |
+
|
77 |
+
## Training Procedure
|
78 |
+
We heavily referred "[Beating Edinburgh's WMT2017 system for en-de with Marian's Transformer model](https://github.com/marian-nmt/marian-examples/tree/master/wmt2017-transformer)"
|
79 |
+
for training process and hyperparameter tuning.
|
80 |
+
|
81 |
+
1. Trains a sentencepiece tokenizer 32k vocab on 4M lines openly licensed corpus.
|
82 |
+
2. Trains `ja-en` back-translation model on 4M lines openly licensed corpus for 6 epochs. = **ElanMT-base-ja-en**
|
83 |
+
3. Trains `en-ja` base translation model on 4M lines openly licensed corpus for 6 epochs. = **ElanMT-base-en-ja**
|
84 |
+
4. Translates 20M lines `ja` Wikipedia to `en` using back-translation model.
|
85 |
+
5. Trains 4 `en-ja` models, which is finetuned from **ElanMT-base-en-ja** checkpoint, on 24M lines training data augmented with back-translated data for 6 epochs.
|
86 |
+
6. Merges 4 trained models that produces the best validation score on FLORES+ dev split.
|
87 |
+
7. Finetunes the merged model on 1M lines high quality corpus subset for 5 epochs.
|
88 |
+
|
89 |
+
## Evaluation
|
90 |
+
### Dataset
|
91 |
+
- [FLORES+](https://github.com/openlanguagedata/flores) (CC BY-SA 4.0) devtest split is used for evaluation.
|
92 |
+
- [NTREX](https://github.com/MicrosoftTranslator/NTREX) (CC BY-SA 4.0)
|
93 |
+
|
94 |
+
### Result
|
95 |
+
| **Model** | **Params** | **FLORES+ BLEU** | **FLORES+ chrf** | **NTREX BLEU** | **NTREX chrf** |
|
96 |
+
|:---|---:|---:|---:|---:|---:|
|
97 |
+
| [**ElanMT-BT**](https://huggingface.co/Mitsua/elan-mt-bt-en-ja) | 61M | 29.96 | **38.43** | **25.63** | **35.41**|
|
98 |
+
| [**ElanMT-base**](https://huggingface.co/Mitsua/elan-mt-base-en-ja) **w/o back-translation** | 61M | 26.55 | 35.28 | 23.04 | 32.94|
|
99 |
+
| [**ElanMT-tiny**](https://huggingface.co/Mitsua/elan-mt-tiny-en-ja) | 15M | 25.93 | 34.69 | 22.78 | 33.00|
|
100 |
+
| [staka/fugumt-en-ja](https://huggingface.co/staka/fugumt-en-ja) (*1) | 61M | **30.89** | 38.38 | 24.74 | 34.23|
|
101 |
+
| [facebook/mbart-large-50-many-to-many-mmt](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt) | 610M | 26.31 | 34.37 | 23.35 | 32.66|
|
102 |
+
| [facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | 615M | 17.09 | 27.32 | 14.92 | 26.26|
|
103 |
+
| [facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) | 3B | 20.04 | 30.33 | 17.07 | 28.46|
|
104 |
+
| [google/madlad400-3b-mt](https://huggingface.co/google/madlad400-3b-mt) | 3B | 24.62 | 33.89 | 23.64 | 33.48|
|
105 |
+
| [google/madlad400-7b-mt](https://huggingface.co/google/madlad400-7b-mt) | 7B | 25.57 | 34.59 | 24.60 | 34.43|
|
106 |
+
|
107 |
+
- *1 tested on `transformers==4.29.2` and `num_beams=4`
|
108 |
+
- *2 BLEU score is calculated by `sacreBLEU` with `tokenize=ja-mecab`
|
109 |
+
|
110 |
+
## Disclaimer
|
111 |
+
- The translated result may be very incorrect, harmful or biased. The model was developed to investigate achievable performance with only a relatively small, licensed corpus, and is not suitable for use cases requiring high translation accuracy. Under Section 5 of the CC BY-SA 4.0 License, ELAN MITSUA Project / Abstract Engine is not responsible for any direct or indirect loss caused by the use of the model.
|
112 |
+
- 免責事項:翻訳結果は不正確で、有害であったりバイアスがかかっている可能性があります。本モデルは比較的小規模でライセンスされたコーパスのみで達成可能な性能を調査するために開発されたモデルであり、翻訳の正確性が必要なユースケースでの使用には適していません。絵藍ミツアプロジェクト及び株式会社アブストラクトエンジンはCC BY-SA 4.0ライセンス第5条に基づき、本モデルの使用によって生じた直接的または間接的な損失に対して、一切の責任を負いません。
|