davidkim205's picture
Upload folder using huggingface_hub
577164e verified
---
title: ko-translation-leaderbaord
app_file: leaderboard.py
sdk: gradio
sdk_version: 3.50.2
---
# Iris Translation
![iris-icon.jpeg](assets%2Firis-icon.jpeg)
Welcome to Iris Translation, a project designed to evaluate Korean-to-English translation models. Our project provides a comprehensive framework for evaluating the Iris model that we have developed.
## Models
๋ฒˆ์—ญ ํ’ˆ์งˆ์„ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ๋ชจ๋‘ ์‹คํ–‰ ๊ฐ€๋Šฅํ•˜๋ฉฐ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
- [davidkim205/iris-7b](https://huggingface.co/davidkim205/iris-7b)
- [squarelike/Gugugo-koen-7B-V1.1](https://huggingface.co/squarelike/Gugugo-koen-7B-V1.1)
- [maywell/Synatra-7B-v0.3-Translation](https://huggingface.co/maywell/Synatra-7B-v0.3-Translation)
- [Unbabel/TowerInstruct-7B-v0.1](https://huggingface.co/Unbabel/TowerInstruct-7B-v0.1)
- [jbochi/madlad400-10b-mt](https://huggingface.co/jbochi/madlad400-10b-mt)
- [facebook/mbart-large-50-many-to-many-mmt](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt)
- [facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)
## Installation
```
conda create -n translation python=3.10
conda activate translation
pip install -r requirements.txt
```
## Usage
์ž…๋ ฅ์œผ๋กœ ์ฃผ์–ด์ง€๋Š” ๊ธฐ๋ณธ ํŒŒ์ผ์€ `./data/komt-1810k-test.jsonl`์ž…๋‹ˆ๋‹ค. ๋‹ค์Œ์€ ๋ฐ์ดํ„ฐ์˜ JSON ์Šคํ‚ค๋งˆ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค.
```json
{
"conversations":[
{
"from":"human",
"value":"๋‹ค์Œ ๋ฌธ์žฅ์„ ํ•œ๊ธ€๋กœ ๋ฒˆ์—ญํ•˜์„ธ์š”.\nLet's make a graph here showing different levels of interest in activities."
},
{
"from":"gpt",
"value":"ํ™œ๋™์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ์ˆ˜์ค€์˜ ๊ด€์‹ฌ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ทธ๋ž˜ํ”„๋ฅผ ๋งŒ๋“ค์–ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค."
}
],
"src":"aihub-MTPE"
}
```
### translate(Bleu)
๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ ๋ฒˆ์—ญ ๊ฒฐ๊ณผ์™€ ์‹ค์ œ ๋ฒˆ์—ญ ๊ฒฐ๊ณผ๋ฅผ ๋น„๊ตํ•˜์—ฌ bleu score๋ฅผ ๊ตฌํ•ฉ๋‹ˆ๋‹ค.
```
python translation.py --model davidkim205/iris-7b
```
๊ฒฐ๊ณผ ํŒŒ์ผ์˜ ๊ฒฝ๋กœ๋Š” `results_bleu/iris-7b-result.jsonl`์ž…๋‹ˆ๋‹ค.
JSON ์Šคํ‚ค๋งˆ ์˜ˆ์‹œ
- reference: ์‹ค์ œ ์ •๋‹ต ๋ฒˆ์—ญ๋ฌธ
- generation: ๋ชจ๋ธ์ด ์ƒ์„ฑํ•œ ๋ฒˆ์—ญ๋ฌธ
```json
{
"index":0,
"reference":"ํ™œ๋™์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ์ˆ˜์ค€์˜ ๊ด€์‹ฌ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ทธ๋ž˜ํ”„๋ฅผ ๋งŒ๋“ค์–ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.",
"generation":"์—ฌ๊ธฐ์„œ ํ™œ๋™์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ์ˆ˜์ค€์˜ ๊ด€์‹ฌ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ทธ๋ž˜ํ”„๋ฅผ ๋งŒ๋“ค์–ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.",
"bleu":0.917,
"lang":"en",
"model":"davidkim205/iris-7b",
"src":"aihub-MTPE",
"conversations":[
{
"from":"human",
"value":"๋‹ค์Œ ๋ฌธ์žฅ์„ ํ•œ๊ธ€๋กœ ๋ฒˆ์—ญํ•˜์„ธ์š”.\nLet's make a graph here showing different levels of interest in activities."
},
{
"from":"gpt",
"value":"ํ™œ๋™์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ์ˆ˜์ค€์˜ ๊ด€์‹ฌ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ทธ๋ž˜ํ”„๋ฅผ ๋งŒ๋“ค์–ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค."
}
]
}
```
### translate_self(SBleu)
๋ชจ๋ธ ๋ฒˆ์—ญ ๊ฒฐ๊ณผ๋ฅผ ๋‹ค์‹œ ๋ฒˆ์—ญํ•˜์—ฌ ์›๋ฌธ๊ณผ์˜ bleu score๋ฅผ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค.
```
python translation_self.py --model davidkim205/iris-7b
```
๊ฒฐ๊ณผ ํŒŒ์ผ์˜ ๊ฒฝ๋กœ๋Š” `results_self/iris-7b-result.jsonl`์ž…๋‹ˆ๋‹ค.
JSON ์Šคํ‚ค๋งˆ ์˜ˆ์‹œ
- reference: ์›๋ฌธ
- generation: ๋ชจ๋ธ ์žฌ๋ฒˆ์—ญ ๊ฒฐ๊ณผ
- generation1: ๋ชจ๋ธ ๋ฒˆ์—ญ๋ฌธ
```json
{
"index":0,
"reference":"Let's make a graph here showing different levels of interest in activities.",
"generation":"let's create a graph that shows different levels of interest in activities here",
"generation1":"์—ฌ๊ธฐ์„œ ํ™œ๋™์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ์ˆ˜์ค€์˜ ๊ด€์‹ฌ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ทธ๋ž˜ํ”„๋ฅผ ๋งŒ๋“ค์–ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.",
"bleu":0.49,
"lang":"en",
"model":"davidkim205/iris-7b",
"src":"aihub-MTPE",
"conversations":[
{
"from":"human",
"value":"๋‹ค์Œ ๋ฌธ์žฅ์„ ํ•œ๊ธ€๋กœ ๋ฒˆ์—ญํ•˜์„ธ์š”.\nLet's make a graph here showing different levels of interest in activities."
},
{
"from":"gpt",
"value":"ํ™œ๋™์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ์ˆ˜์ค€์˜ ๊ด€์‹ฌ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ทธ๋ž˜ํ”„๋ฅผ ๋งŒ๋“ค์–ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค."
}
]
}
```
### translate2(Bleu and SBleu)
translate์™€ translate_self๋ฅผ ๋ชจ๋‘ ์ˆ˜ํ–‰ํ•˜์—ฌ bleu ๋ฐ sbleu๋ฅผ ๋ชจ๋‘ ๋น„๊ตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
```
python translation2.py --model davidkim205/iris-7b
```
- translate๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ `results_bleu/iris-7b-result.jsonl`์— ์ €์žฅ
- translate_self๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ `results_self/iris-7b-result.jsonl`์— ์ €์žฅ
๊ฐ ํŒŒ์ผ์€ ์œ„์—์„œ ์ƒ์„ฑํ•œ ๋‘ ํŒŒ์ผ๊ณผ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ–์Šต๋‹ˆ๋‹ค.
## Evaluation
๋‘ ๊ฐ€์ง€ ๋ฐฉ์‹์œผ๋กœ ๋ฒˆ์—ญ ๊ฒฐ๊ณผ๋ฅผ ๊ฒ€์ฆํ•ฉ๋‹ˆ๋‹ค.
1. ์‹ค์ œ ๋ฒˆ์—ญ๊ณผ ๋ชจ๋ธ ๋ฒˆ์—ญ์„ ๋น„๊ตํ•˜์—ฌ ํ‰๊ฐ€
```
python evaluate.py results_bleu/
```
output
```
bleu scores
result_bleu-nllb200.jsonl: 0.26, out_of_range_count=3, duplicate=1
result_bleu-madlad400.jsonl: 0.29, out_of_range_count=6, duplicate=3
result_bleu-TowerInstruct.jsonl: 0.32, out_of_range_count=9, duplicate=1
result_bleu-gugugo.jsonl: 0.32, out_of_range_count=3, duplicate=1
result_bleu-Synatra-7B-v0.3-Translation.jsonl: 0.35, out_of_range_count=2, duplicate=1
result_bleu-deepl.jsonl: 0.39, out_of_range_count=1, duplicate=0
result_bleu-azure.jsonl: 0.40, out_of_range_count=2, duplicate=0
result_bleu-google.jsonl: 0.40, out_of_range_count=3, duplicate=0
result_bleu-papago.jsonl: 0.43, out_of_range_count=3, duplicate=0
result_bleu-iris_7b.jsonl: 0.40, out_of_range_count=3, duplicate=0
```
2. ์›๋ฌธ์„ 2๋ฒˆ ๋ฒˆ์—ญ(์˜->ํ•œ->์˜)ํ•œ ๊ฒฐ๊ณผ์™€ ๋น„๊ตํ•˜์—ฌ ํ‰๊ฐ€
```
python evaluate.py results_self/
```
output
```
bleu scores
result_self-nllb200.jsonl: 0.30, out_of_range_count=1, duplicate=1
result_self-gugugo.jsonl: 0.36, out_of_range_count=1, duplicate=1
result_self-madlad400.jsonl: 0.38, out_of_range_count=3, duplicate=2
result_self-TowerInstruct.jsonl: 0.39, out_of_range_count=3, duplicate=0
result_self-Synatra-7B-v0.3-Translation.jsonl: 0.41, out_of_range_count=2, duplicate=1
result_self-deepl.jsonl: 0.45, out_of_range_count=0, duplicate=0
result_self-papago.jsonl: 0.49, out_of_range_count=0, duplicate=0
result_self-azure.jsonl: 0.49, out_of_range_count=0, duplicate=1
result_self-google.jsonl: 0.49, out_of_range_count=0, duplicate=0
result_self-papago.jsonl: 0.51, out_of_range_count=0, duplicate=0
result_self-iris_7b.jsonl: 0.43, out_of_range_count=1, duplicate=0
```
**ํ‰๊ฐ€ ์š”์†Œ**
- BLEU: ์‹ค์ œ ๋ฒˆ์—ญ๊ณผ ๋ชจ๋ธ ๋ฒˆ์—ญ์˜ bleu score ํ‰๊ท 
- SBLEU: ์›๋ฌธ๊ณผ ์žฌ๋ฒˆ์—ญ์˜ bleu score ํ‰๊ท 
- Duplicate: ๋ฒˆ์—ญ ์‹œ ์ค‘๋ณต๋œ ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฒฝ์šฐ
- Length Exceeds: ๋ชจ๋ธ ๋ฒˆ์—ญ๊ณผ ์‹ค์ œ ๋ฒˆ์—ญ ๊ธธ์ด์˜ ๋ถˆ์ผ์น˜(0.2 < length < 2 ๊ธฐ์ค€)
### BLEU
๊ฐ ๋ชจ๋ธ๋ณ„๋กœ ํ‰๊ฐ€ํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. iris-7b ๋ชจ๋ธ์˜ ํ‰๊ฐ€๋Š” ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.
- ๋ชจ๋“  ํ‰๊ฐ€์—์„œ ๊ธฐ์กด ๋ชจ๋ธ๋“ค๋ณด๋‹ค ๋†’์€ ๋ฒˆ์—ญ ์„ฑ๋Šฅ
- ํ‰๊ท ์ ์œผ๋กœ ํด๋ผ์šฐ๋“œ ๋ฒˆ์—ญ๊ณผ ๋™์ผํ•œ ๋ฒˆ์—ญ ์„ฑ๋Šฅ
- ์ค‘๋ณต ๋ฌธ์žฅ ์ƒ์„ฑ ๋ฐ ๊ธธ์ด ์ดˆ๊ณผ ๋ฌธ์ œ๋Š” ํด๋ผ์šฐ๋“œ ๋ฒˆ์—ญ๊ณผ ๋™์ผํ•œ ์ˆ˜์ค€
![plot-bleu.png](assets%2Fplot-bleu.png)
Duplicate(์ค‘๋ณต ๋ฌธ์žฅ ์ƒ์„ฑ)์™€ Length Exceeds(๊ธธ์ด ์ดˆ๊ณผ)๋Š” translation(bleu)์˜ ์ง€ํ‘œ์ž…๋‹ˆ๋‹ค.
| TYPE | Model | BLEU | SBLEU | Duplicate | Length Exceeds |
| ----------- | :---------------------------------- | ---- | ----- | --------- | -------------- |
| HuggingFace | facebook/nllb-200-distilled-1.3B | 0.26 | 0.30 | 1 | 3 |
| HuggingFace | jbochi/madlad400-10b-mt | 0.29 | 0.38 | 3 | 6 |
| HuggingFace | Unbabel/TowerInstruct-7B-v0.1 | 0.32 | 0.39 | 1 | 9 |
| HuggingFace | squarelike/Gugugo-koen-7B-V1.1 | 0.32 | 0.36 | 1 | 3 |
| HuggingFace | maywell/Synatra-7B-v0.3-Translation | 0.35 | 0.41 | 1 | 2 |
| Cloud | deepl | 0.39 | 0.45 | 0 | 1 |
| Cloud | azure | 0.40 | 0.49 | 0 | 3 |
| Cloud | google | 0.40 | 0.49 | 0 | 2 |
| Cloud | papago | 0.43 | 0.51 | 0 | 3 |
| HuggingFace | davidkim205/iris-7b (**ours**) | 0.40 | 0.43 | 0 | 3 |
* SBLEU: Self-evaluation BLEU
### BLEU by source
๋ถ„์•ผ๋ณ„๋กœ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹ ๋ฒˆ์—ญ ํ’ˆ์งˆ์„ ํ‰๊ฐ€ํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. iris-7b ๋ชจ๋ธ์˜ ํ‰๊ฐ€๋Š” ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.
- ๋ชจ๋“  ๋ถ„์•ผ์—์„œ ๊ธฐ์กด ๋ฒˆ์—ญ๋ชจ๋ธ์„ ์••๋„ํ•˜๋Š” ์„ฑ๋Šฅ
- ๋งŽ์€ ๋ถ„์•ผ์—์„œ ํด๋ผ์šฐ๋“œ ๋ฒˆ์—ญ๊ณผ ๋น„์Šทํ•˜๊ฑฐ๋‚˜, ๋” ๋‚˜์€ ์„ฑ๋Šฅ
- ๊ณผํ•™ ๋ถ„์•ผ, ์‹ ์กฐ์–ด ๋ถ„์•ผ์˜ ๋ฒˆ์—ญ ํ’ˆ์งˆ์ด ๋งค์šฐ ์šฐ์ˆ˜
![plot-bleu-by-src.png](assets%2Fplot-bleu-by-src.png)
| Type | Model | Average | MTPE | techsci2 | expertise | humanities | sharegpt-deepl-ko-translation | MT-new-corpus | socialsci | korean-parallel-corpora | parallel-translation | food | techsci | para_pat | speechtype-based-machine-translation | koopus100 | basicsci | broadcast-content | patent | colloquial |
| ----------- | :---------------------------------- | ------- | ---: | -------: | --------: | ---------: | ----------------------------: | ------------: | --------: | ----------------------: | -------------------: | ---: | ------: | -------: | -----------------------------------: | --------: | -------: | ----------------: | -----: | ---------: |
| HuggingFace | facebook/nllb-200-distilled-1.3B | 0.26 | 0.44 | 0.28 | 0.16 | 0.23 | 0.44 | 0.34 | 0.27 | 0.10 | 0.23 | 0.37 | 0.28 | 0.19 | 0.29 | 0.23 | 0.15 | 0.33 | 0.09 | 0.29 |
| HuggingFace | jbochi/madlad400-10b-mt | 0.29 | 0.45 | 0.29 | 0.20 | 0.29 | 0.40 | 0.36 | 0.39 | 0.12 | 0.22 | 0.46 | 0.30 | 0.23 | 0.48 | 0.23 | 0.19 | 0.36 | 0.01 | 0.33 |
| HuggingFace | Unbabel/TowerInstruct-7B-v0.1 | 0.32 | 0.46 | 0.33 | 0.28 | 0.27 | 0.30 | 0.39 | 0.37 | 0.14 | 0.35 | 0.47 | 0.39 | 0.29 | 0.41 | 0.21 | 0.22 | 0.36 | 0.15 | 0.33 |
| HuggingFace | squarelike/Gugugo-koen-7B-V1.1 | 0.32 | 0.46 | 0.27 | 0.28 | 0.22 | 0.66 | 0.33 | 0.36 | 0.10 | 0.29 | 0.45 | 0.34 | 0.24 | 0.42 | 0.22 | 0.23 | 0.42 | 0.20 | 0.26 |
| HuggingFace | maywell/Synatra-7B-v0.3-Translation | 0.35 | 0.43 | 0.36 | 0.27 | 0.23 | 0.70 | 0.37 | 0.31 | 0.13 | 0.34 | 0.52 | 0.35 | 0.29 | 0.44 | 0.21 | 0.24 | 0.46 | 0.28 | 0.37 |
| Cloud | deepl | 0.39 | 0.59 | 0.33 | 0.31 | 0.32 | 0.70 | 0.48 | 0.38 | 0.14 | 0.38 | 0.55 | 0.41 | 0.33 | 0.48 | 0.24 | 0.28 | 0.42 | 0.37 | 0.36 |
| Cloud | azure | 0.40 | 0.57 | 0.36 | 0.35 | 0.29 | 0.63 | 0.46 | 0.39 | 0.16 | 0.38 | 0.56 | 0.39 | 0.33 | 0.54 | 0.22 | 0.29 | 0.52 | 0.35 | 0.41 |
| Cloud | google | 0.40 | 0.62 | 0.39 | 0.32 | 0.32 | 0.60 | 0.45 | 0.45 | 0.14 | 0.38 | 0.59 | 0.43 | 0.34 | 0.45 | 0.22 | 0.28 | 0.47 | 0.39 | 0.36 |
| Cloud | papago | 0.43 | 0.56 | 0.43 | 0.41 | 0.30 | 0.55 | 0.58 | 0.56 | 0.16 | 0.37 | 0.67 | 0.52 | 0.35 | 0.53 | 0.21 | 0.35 | 0.45 | 0.37 | 0.46 |
| HuggingFace | davidkim205/iris-7b (**ours**) | 0.40 | 0.49 | 0.37 | 0.34 | 0.31 | 0.72 | 0.48 | 0.43 | 0.11 | 0.33 | 0.56 | 0.46 | 0.34 | 0.43 | 0.20 | 0.30 | 0.47 | 0.41 | 0.40 |
### BLEU by sentence length
ํ…์ŠคํŠธ์˜ ๊ธธ์ด์— ๋”ฐ๋ผ 4๊ตฌ๊ฐ„์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ 50๊ฐœ์”ฉ ์ƒ˜ํ”Œ๋งํ•˜์—ฌ ๋ฒˆ์—ญํ•œ ํ‰๊ท  ์ ์ˆ˜์ž…๋‹ˆ๋‹ค. ํ‰๊ฐ€์— ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ์…‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
- `data/komt-dataset-100.jsonl`
- `data/komt-dataset-500.jsonl`
- `data/komt-dataset-1000.jsonl`
- `data/komt-dataset-1500.jsonl`
๋ฒˆ์—ญ ๋ฐ bleu score ๊ฒฐ๊ณผ๋Š” `results_length/`์•„๋ž˜์— ์ €์žฅ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
๋†€๋ž๊ฒŒ๋„, iris-7b ๋ชจ๋ธ์€ ๋ชจ๋“  ๊ตฌ๊ฐ„์—์„œ ๋Œ€๋ถ€๋ถ„์˜ ํด๋ผ์šฐ๋“œ ๋ฒˆ์—ญ๋ณด๋‹ค ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.
- ~100: (0, 100]
- ~500: (100, 500]
- ~1000: (500, 1000]
- ~1500: (1000, 1500]
![plot-bleu-by-sentence-length.png](assets%2Fplot-bleu-by-sentence-length.png)
| Type | Model | Average | ~100(50) | ~500(50) | ~1000(50) | ~1500(50) |
| ----------- | :---------------------------------- | ------- | -------: | -------: | --------: | --------: |
| HuggingFace | facebook/nllb-200-distilled-1.3B | 0.24 | 0.31 | 0.31 | 0.22 | 0.13 |
| HuggingFace | jbochi/madlad400-10b-mt | 0.22 | 0.35 | 0.37 | 0.08 | 0.10 |
| HuggingFace | Unbabel/TowerInstruct-7B-v0.1 | 0.32 | 0.41 | 0.31 | 0.24 | 0.32 |
| HuggingFace | squarelike/Gugugo-koen-7B-V1.1 | 0.45 | 0.37 | 0.48 | 0.52 | 0.43 |
| HuggingFace | maywell/Synatra-7B-v0.3-Translation | 0.50 | 0.41 | 0.57 | 0.57 | 0.51 |
| Cloud | deepl | 0.53 | 0.44 | 0.56 | 0.64 | 0.50 |
| Cloud | azure | 0.47 | 0.46 | 0.47 | 0.52 | 0.44 |
| Cloud | google | 0.51 | 0.50 | 0.49 | 0.54 | 0.51 |
| Cloud | papago | 0.46 | 0.50 | 0.46 | 0.43 | 0.45 |
| HuggingFace | davidkim205/iris-7b (**ours**) | 0.56 | 0.51 | 0.58 | 0.62 | 0.54 |
## test dataset info
ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์€ 18๊ฐ€์ง€ ๋ถ„์•ผ์˜ ๋ฐ์ดํ„ฐ 10๊ฐœ๋กœ, ์ด 180๊ฐœ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.
`koopus100` ๋ฐ์ดํ„ฐ์…‹์€ ๊ธธ์ด๊ฐ€ ์งง๊ณ  ์›๋ฌธ๊ณผ ๋ฒˆ์—ญ๋ฌธ์ด ์ผ์น˜ํ•˜์ง€ ์•Š๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์กด์žฌํ•˜์—ฌ ํ’ˆ์งˆ์ด ๋‚ฎ์Šต๋‹ˆ๋‹ค.
```
text: All right
translation: ๋ณ„๋กœ ๊ทธ๋Ÿด ๊ธฐ๋ถ„ ์•„๋‹ˆ์•ผ - I'm not in the mood.
text: Do you have a fever?
translation: ๋ญ๋ผ๊ณ  ํ–ˆ์–ด?
```
`korean-parallel-corpora` ๋ฐ์ดํ„ฐ์…‹์€ ๋ฒˆ์—ญ๋ฌธ์— ํ•œ์˜์ด ํ˜ผ์šฉ๋˜๊ฑฐ๋‚˜, ์ž˜๋ชป ๋ฒˆ์—ญ๋˜์–ด ํ’ˆ์งˆ์ด ๋‚ฎ์Šต๋‹ˆ๋‹ค.
```
text: S. Korea mulls missile defense system ํ•œ๊ตญ, ์ž์ฒด์  ๋ฏธ์‚ฌ์ผ ๋ฐฉ์–ด์ฒด๊ณ„ ์ˆ˜๋ฆฝ ๊ฒ€ํ† ย  ย  ย 2007.03
translation: South Korea maintains a mandatory draft system under which all able-bodied men over 20 must serve in the military for 24 to 27 months.
text: A United States intelligence agency has been collecting data on the phone calls of tens of millions of Americans, a report in USA Today has alleged.
translation: NSA collects Americansโ€™phone clall data๋ฏธ ๊ตญ๊ฐ€์•ˆ๋ณด๊ตญ, ๋ฏธ๊ตญ๋ฏผ ํ†ตํ™” ๋‚ด์šฉ ์ˆ˜์ง‘2006.07
text: I see the guy as more like John Wayne, which is to say I don't like his politics but he's endearing in a strange, goofy, awkward way, and he did capture the imagination of the country,\" he said.
translation: ๋ฒ ํŠธ๋‚จ์ „์— ์ฐธ์ „ํ–ˆ๋˜ ์Šคํ†ค ๊ฐ๋…์€ ๋น„ํŒ์ ์œผ๋กœ ํ˜ธํ‰์„ ๋ฐ›๊ณ  ์ •์น˜์ ์ธ ์„ฑํ–ฅ์ด ๋งŽ์€ ์˜ํ™”๋ฅผ ์ œ์ž‘ํ•œ ๊ฒƒ์œผ๋กœ ์œ ๋ช…ํ•˜๋‹ค.
text: The Sahara is advancing into Ghana and Nigeria at the rate of 3,510 square kilometers per year.
translation: ์นด์žํ์Šคํƒ„ ๋˜ํ•œ ์‚ฌ๋ง‰ํ™”๋กœ ์ธํ•ด 1980๋…„ ์ดํ›„ ๋†๊ฒฝ์ง€์˜ 50%๊ฐ€ ์‚ฌ๋ผ์กŒ์œผ๋ฉฐ ์‚ฌํ•˜๋ผ ์‚ฌ๋ง‰์€ ๋งค๋…„ 3510ใŽข์”ฉ ์ปค์ ธ๊ฐ€๋ฉฐ ๊ฐ€๋‚˜์™€ ๋‚˜์ด์ง€๋ฆฌ์•„๋ฅผ ์œ„ํ˜‘ํ•˜๊ณ  ์žˆ๋‹ค.
```
์•„๋ž˜ ํ‘œ์—๋Š” ๊ฐ src์˜ ๋น„์œจ๊ณผ ๊ฐœ์ˆ˜, ์„ค๋ช…์ด ์ •๋ฆฌ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
| src | ratio | description |
| ------------------------------------------ | ----- | ------------------------------------------------------------ |
| aihub-MTPE | 5.56% | ๊ธฐ๊ณ„๋ฒˆ์—ญ ํ’ˆ์งˆ ์‚ฌํ›„๊ฒ€์ฆ ๋ฐ์ดํ„ฐ์…‹ |
| aihub-techsci2 | 5.56% | ICT, ์ „๊ธฐ/์ „์ž ๋“ฑ ๊ธฐ์ˆ ๊ณผํ•™ ๋ถ„์•ผ ํ•œ์˜ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹ |
| aihub-expertise | 5.56% | ์˜๋ฃŒ, ๊ธˆ์œต, ์Šคํฌ์ธ  ๋“ฑ ์ „๋ฌธ๋ถ„์•ผ ํ•œ์˜ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹ |
| aihub-humanities | 5.56% | ์ธ๋ฌธํ•™ ๋ถ„์•ผ ํ•œ์˜ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹ |
| sharegpt-deepl-ko-translation | 5.56% | shareGPT ๋ฐ์ดํ„ฐ์…‹์„ ์งˆ๋‹ต ํ˜•์‹์—์„œ ํ•œ์˜ ๋ฒˆ์—ญ ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•œ ๋ฐ์ดํ„ฐ์…‹ |
| aihub-MT-new-corpus | 5.56% | ๊ธฐ๊ณ„ ๋ฒˆ์—ญ ์•ฑ ๊ตฌ์ถ•์šฉ ํ•œ์˜ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹ |
| aihub-socialsci | 5.56% | ๋ฒ•๋ฅ , ๊ต์œก, ๊ฒฝ์ œ ๋“ฑ ์‚ฌํšŒ๊ณผํ•™ ๋ถ„์•ผ ํ•œ์˜ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹ |
| korean-parallel-corpora | 5.56% | ํ•œ์˜ ๋ฒˆ์—ญ ๋ณ‘๋ ฌ ๋ฐ์ดํ„ฐ์…‹ |
| aihub-parallel-translation | 5.56% | ๋ฐœํ™” ์œ ํ˜• ๋ฐ ๋ถ„์•ผ๋ณ„ ํ•œ์˜ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹ |
| aihub-food | 5.56% | ์‹ํ’ˆ ๋ถ„์•ผ ์˜ํ•œ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹ |
| aihub-techsci | 5.56% | ICT, ์ „๊ธฐ/์ „์ž ๋“ฑ ๊ธฐ์ˆ ๊ณผํ•™ ๋ถ„์•ผ ํ•œ์˜ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹ |
| para_pat | 5.56% | ParaPat ๋ฐ์ดํ„ฐ์…‹์˜ ์˜์–ด-ํ•œ๊ตญ์–ด subset |
| aihub-speechtype-based-machine-translation | 5.56% | ๋ฐœํ™” ์œ ํ˜•๋ณ„ ์˜ํ•œ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹ |
| koopus100 | 5.56% | OPUS-100 ๋ฐ์ดํ„ฐ์…‹์˜ ์˜์–ด-ํ•œ๊ตญ์–ด subset |
| aihub-basicsci | 5.56% | ์ˆ˜ํ•™, ๋ฌผ๋ฆฌํ•™ ๋“ฑ ๊ธฐ์ดˆ๊ณผํ•™ ๋ถ„์•ผ ํ•œ์˜ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹ |
| aihub-broadcast-content | 5.56% | ๋ฐฉ์†ก ์ฝ˜ํ…์ธ  ๋ถ„์•ผ ํ•œ์˜ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹ |
| aihub-patent | 5.56% | ํŠนํ—ˆ๋ช…์„ธ์„œ ์˜ํ•œ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹ |
| aihub-colloquial | 5.56% | ์‹ ์กฐ์–ด, ์•ฝ์–ด ๋“ฑ์„ ํฌํ•จํ•˜๋Š” ๊ตฌ์–ด์ฒด ํ•œ์˜ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹ |