davidkim205's picture
Upload folder using huggingface_hub
577164e verified
metadata
title: ko-translation-leaderbaord
app_file: leaderboard.py
sdk: gradio
sdk_version: 3.50.2

Iris Translation

iris-icon.jpeg

Welcome to Iris Translation, a project designed to evaluate Korean-to-English translation models. Our project provides a comprehensive framework for evaluating the Iris model that we have developed.

Models

๋ฒˆ์—ญ ํ’ˆ์งˆ์„ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ๋ชจ๋‘ ์‹คํ–‰ ๊ฐ€๋Šฅํ•˜๋ฉฐ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Installation

conda create -n translation python=3.10
conda activate translation

pip install -r requirements.txt

Usage

์ž…๋ ฅ์œผ๋กœ ์ฃผ์–ด์ง€๋Š” ๊ธฐ๋ณธ ํŒŒ์ผ์€ ./data/komt-1810k-test.jsonl์ž…๋‹ˆ๋‹ค. ๋‹ค์Œ์€ ๋ฐ์ดํ„ฐ์˜ JSON ์Šคํ‚ค๋งˆ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค.

{
    "conversations":[
        {
            "from":"human",
            "value":"๋‹ค์Œ ๋ฌธ์žฅ์„ ํ•œ๊ธ€๋กœ ๋ฒˆ์—ญํ•˜์„ธ์š”.\nLet's make a graph here showing different levels of interest in activities."
        },
        {
            "from":"gpt",
            "value":"ํ™œ๋™์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ์ˆ˜์ค€์˜ ๊ด€์‹ฌ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ทธ๋ž˜ํ”„๋ฅผ ๋งŒ๋“ค์–ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค."
        }
    ],
    "src":"aihub-MTPE"
}

translate(Bleu)

๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ ๋ฒˆ์—ญ ๊ฒฐ๊ณผ์™€ ์‹ค์ œ ๋ฒˆ์—ญ ๊ฒฐ๊ณผ๋ฅผ ๋น„๊ตํ•˜์—ฌ bleu score๋ฅผ ๊ตฌํ•ฉ๋‹ˆ๋‹ค.

python translation.py --model davidkim205/iris-7b

๊ฒฐ๊ณผ ํŒŒ์ผ์˜ ๊ฒฝ๋กœ๋Š” results_bleu/iris-7b-result.jsonl์ž…๋‹ˆ๋‹ค.

JSON ์Šคํ‚ค๋งˆ ์˜ˆ์‹œ

  • reference: ์‹ค์ œ ์ •๋‹ต ๋ฒˆ์—ญ๋ฌธ
  • generation: ๋ชจ๋ธ์ด ์ƒ์„ฑํ•œ ๋ฒˆ์—ญ๋ฌธ
{
    "index":0,
    "reference":"ํ™œ๋™์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ์ˆ˜์ค€์˜ ๊ด€์‹ฌ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ทธ๋ž˜ํ”„๋ฅผ ๋งŒ๋“ค์–ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.",
    "generation":"์—ฌ๊ธฐ์„œ ํ™œ๋™์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ์ˆ˜์ค€์˜ ๊ด€์‹ฌ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ทธ๋ž˜ํ”„๋ฅผ ๋งŒ๋“ค์–ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.",
    "bleu":0.917,
    "lang":"en",
    "model":"davidkim205/iris-7b",
    "src":"aihub-MTPE",
    "conversations":[
        {
            "from":"human",
            "value":"๋‹ค์Œ ๋ฌธ์žฅ์„ ํ•œ๊ธ€๋กœ ๋ฒˆ์—ญํ•˜์„ธ์š”.\nLet's make a graph here showing different levels of interest in activities."
        },
        {
            "from":"gpt",
            "value":"ํ™œ๋™์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ์ˆ˜์ค€์˜ ๊ด€์‹ฌ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ทธ๋ž˜ํ”„๋ฅผ ๋งŒ๋“ค์–ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค."
        }
    ]
}

translate_self(SBleu)

๋ชจ๋ธ ๋ฒˆ์—ญ ๊ฒฐ๊ณผ๋ฅผ ๋‹ค์‹œ ๋ฒˆ์—ญํ•˜์—ฌ ์›๋ฌธ๊ณผ์˜ bleu score๋ฅผ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค.

python translation_self.py --model davidkim205/iris-7b

๊ฒฐ๊ณผ ํŒŒ์ผ์˜ ๊ฒฝ๋กœ๋Š” results_self/iris-7b-result.jsonl์ž…๋‹ˆ๋‹ค.

JSON ์Šคํ‚ค๋งˆ ์˜ˆ์‹œ

  • reference: ์›๋ฌธ
  • generation: ๋ชจ๋ธ ์žฌ๋ฒˆ์—ญ ๊ฒฐ๊ณผ
  • generation1: ๋ชจ๋ธ ๋ฒˆ์—ญ๋ฌธ
{
    "index":0,
    "reference":"Let's make a graph here showing different levels of interest in activities.",
    "generation":"let's create a graph that shows different levels of interest in activities here",
    "generation1":"์—ฌ๊ธฐ์„œ ํ™œ๋™์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ์ˆ˜์ค€์˜ ๊ด€์‹ฌ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ทธ๋ž˜ํ”„๋ฅผ ๋งŒ๋“ค์–ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.",
    "bleu":0.49,
    "lang":"en",
    "model":"davidkim205/iris-7b",
    "src":"aihub-MTPE",
    "conversations":[
        {
            "from":"human",
            "value":"๋‹ค์Œ ๋ฌธ์žฅ์„ ํ•œ๊ธ€๋กœ ๋ฒˆ์—ญํ•˜์„ธ์š”.\nLet's make a graph here showing different levels of interest in activities."
        },
        {
            "from":"gpt",
            "value":"ํ™œ๋™์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ์ˆ˜์ค€์˜ ๊ด€์‹ฌ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ทธ๋ž˜ํ”„๋ฅผ ๋งŒ๋“ค์–ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค."
        }
    ]
}

translate2(Bleu and SBleu)

translate์™€ translate_self๋ฅผ ๋ชจ๋‘ ์ˆ˜ํ–‰ํ•˜์—ฌ bleu ๋ฐ sbleu๋ฅผ ๋ชจ๋‘ ๋น„๊ตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

python translation2.py --model davidkim205/iris-7b
  • translate๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ results_bleu/iris-7b-result.jsonl์— ์ €์žฅ
  • translate_self๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ results_self/iris-7b-result.jsonl์— ์ €์žฅ

๊ฐ ํŒŒ์ผ์€ ์œ„์—์„œ ์ƒ์„ฑํ•œ ๋‘ ํŒŒ์ผ๊ณผ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ–์Šต๋‹ˆ๋‹ค.

Evaluation

๋‘ ๊ฐ€์ง€ ๋ฐฉ์‹์œผ๋กœ ๋ฒˆ์—ญ ๊ฒฐ๊ณผ๋ฅผ ๊ฒ€์ฆํ•ฉ๋‹ˆ๋‹ค.

  1. ์‹ค์ œ ๋ฒˆ์—ญ๊ณผ ๋ชจ๋ธ ๋ฒˆ์—ญ์„ ๋น„๊ตํ•˜์—ฌ ํ‰๊ฐ€
python evaluate.py results_bleu/

output

bleu scores
result_bleu-nllb200.jsonl: 0.26, out_of_range_count=3, duplicate=1
result_bleu-madlad400.jsonl: 0.29, out_of_range_count=6, duplicate=3
result_bleu-TowerInstruct.jsonl: 0.32, out_of_range_count=9, duplicate=1
result_bleu-gugugo.jsonl: 0.32, out_of_range_count=3, duplicate=1
result_bleu-Synatra-7B-v0.3-Translation.jsonl: 0.35, out_of_range_count=2, duplicate=1
result_bleu-deepl.jsonl: 0.39, out_of_range_count=1, duplicate=0
result_bleu-azure.jsonl: 0.40, out_of_range_count=2, duplicate=0
result_bleu-google.jsonl: 0.40, out_of_range_count=3, duplicate=0
result_bleu-papago.jsonl: 0.43, out_of_range_count=3, duplicate=0
result_bleu-iris_7b.jsonl: 0.40, out_of_range_count=3, duplicate=0
  1. ์›๋ฌธ์„ 2๋ฒˆ ๋ฒˆ์—ญ(์˜->ํ•œ->์˜)ํ•œ ๊ฒฐ๊ณผ์™€ ๋น„๊ตํ•˜์—ฌ ํ‰๊ฐ€
python evaluate.py results_self/

output

bleu scores
result_self-nllb200.jsonl: 0.30, out_of_range_count=1, duplicate=1
result_self-gugugo.jsonl: 0.36, out_of_range_count=1, duplicate=1
result_self-madlad400.jsonl: 0.38, out_of_range_count=3, duplicate=2
result_self-TowerInstruct.jsonl: 0.39, out_of_range_count=3, duplicate=0
result_self-Synatra-7B-v0.3-Translation.jsonl: 0.41, out_of_range_count=2, duplicate=1
result_self-deepl.jsonl: 0.45, out_of_range_count=0, duplicate=0
result_self-papago.jsonl: 0.49, out_of_range_count=0, duplicate=0
result_self-azure.jsonl: 0.49, out_of_range_count=0, duplicate=1
result_self-google.jsonl: 0.49, out_of_range_count=0, duplicate=0
result_self-papago.jsonl: 0.51, out_of_range_count=0, duplicate=0
result_self-iris_7b.jsonl: 0.43, out_of_range_count=1, duplicate=0

ํ‰๊ฐ€ ์š”์†Œ

  • BLEU: ์‹ค์ œ ๋ฒˆ์—ญ๊ณผ ๋ชจ๋ธ ๋ฒˆ์—ญ์˜ bleu score ํ‰๊ท 
  • SBLEU: ์›๋ฌธ๊ณผ ์žฌ๋ฒˆ์—ญ์˜ bleu score ํ‰๊ท 
  • Duplicate: ๋ฒˆ์—ญ ์‹œ ์ค‘๋ณต๋œ ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฒฝ์šฐ
  • Length Exceeds: ๋ชจ๋ธ ๋ฒˆ์—ญ๊ณผ ์‹ค์ œ ๋ฒˆ์—ญ ๊ธธ์ด์˜ ๋ถˆ์ผ์น˜(0.2 < length < 2 ๊ธฐ์ค€)

BLEU

๊ฐ ๋ชจ๋ธ๋ณ„๋กœ ํ‰๊ฐ€ํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. iris-7b ๋ชจ๋ธ์˜ ํ‰๊ฐ€๋Š” ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • ๋ชจ๋“  ํ‰๊ฐ€์—์„œ ๊ธฐ์กด ๋ชจ๋ธ๋“ค๋ณด๋‹ค ๋†’์€ ๋ฒˆ์—ญ ์„ฑ๋Šฅ
  • ํ‰๊ท ์ ์œผ๋กœ ํด๋ผ์šฐ๋“œ ๋ฒˆ์—ญ๊ณผ ๋™์ผํ•œ ๋ฒˆ์—ญ ์„ฑ๋Šฅ
  • ์ค‘๋ณต ๋ฌธ์žฅ ์ƒ์„ฑ ๋ฐ ๊ธธ์ด ์ดˆ๊ณผ ๋ฌธ์ œ๋Š” ํด๋ผ์šฐ๋“œ ๋ฒˆ์—ญ๊ณผ ๋™์ผํ•œ ์ˆ˜์ค€

plot-bleu.png

Duplicate(์ค‘๋ณต ๋ฌธ์žฅ ์ƒ์„ฑ)์™€ Length Exceeds(๊ธธ์ด ์ดˆ๊ณผ)๋Š” translation(bleu)์˜ ์ง€ํ‘œ์ž…๋‹ˆ๋‹ค.

TYPE Model BLEU SBLEU Duplicate Length Exceeds
HuggingFace facebook/nllb-200-distilled-1.3B 0.26 0.30 1 3
HuggingFace jbochi/madlad400-10b-mt 0.29 0.38 3 6
HuggingFace Unbabel/TowerInstruct-7B-v0.1 0.32 0.39 1 9
HuggingFace squarelike/Gugugo-koen-7B-V1.1 0.32 0.36 1 3
HuggingFace maywell/Synatra-7B-v0.3-Translation 0.35 0.41 1 2
Cloud deepl 0.39 0.45 0 1
Cloud azure 0.40 0.49 0 3
Cloud google 0.40 0.49 0 2
Cloud papago 0.43 0.51 0 3
HuggingFace davidkim205/iris-7b (ours) 0.40 0.43 0 3
  • SBLEU: Self-evaluation BLEU

BLEU by source

๋ถ„์•ผ๋ณ„๋กœ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹ ๋ฒˆ์—ญ ํ’ˆ์งˆ์„ ํ‰๊ฐ€ํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. iris-7b ๋ชจ๋ธ์˜ ํ‰๊ฐ€๋Š” ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • ๋ชจ๋“  ๋ถ„์•ผ์—์„œ ๊ธฐ์กด ๋ฒˆ์—ญ๋ชจ๋ธ์„ ์••๋„ํ•˜๋Š” ์„ฑ๋Šฅ
  • ๋งŽ์€ ๋ถ„์•ผ์—์„œ ํด๋ผ์šฐ๋“œ ๋ฒˆ์—ญ๊ณผ ๋น„์Šทํ•˜๊ฑฐ๋‚˜, ๋” ๋‚˜์€ ์„ฑ๋Šฅ
  • ๊ณผํ•™ ๋ถ„์•ผ, ์‹ ์กฐ์–ด ๋ถ„์•ผ์˜ ๋ฒˆ์—ญ ํ’ˆ์งˆ์ด ๋งค์šฐ ์šฐ์ˆ˜

plot-bleu-by-src.png

Type Model Average MTPE techsci2 expertise humanities sharegpt-deepl-ko-translation MT-new-corpus socialsci korean-parallel-corpora parallel-translation food techsci para_pat speechtype-based-machine-translation koopus100 basicsci broadcast-content patent colloquial
HuggingFace facebook/nllb-200-distilled-1.3B 0.26 0.44 0.28 0.16 0.23 0.44 0.34 0.27 0.10 0.23 0.37 0.28 0.19 0.29 0.23 0.15 0.33 0.09 0.29
HuggingFace jbochi/madlad400-10b-mt 0.29 0.45 0.29 0.20 0.29 0.40 0.36 0.39 0.12 0.22 0.46 0.30 0.23 0.48 0.23 0.19 0.36 0.01 0.33
HuggingFace Unbabel/TowerInstruct-7B-v0.1 0.32 0.46 0.33 0.28 0.27 0.30 0.39 0.37 0.14 0.35 0.47 0.39 0.29 0.41 0.21 0.22 0.36 0.15 0.33
HuggingFace squarelike/Gugugo-koen-7B-V1.1 0.32 0.46 0.27 0.28 0.22 0.66 0.33 0.36 0.10 0.29 0.45 0.34 0.24 0.42 0.22 0.23 0.42 0.20 0.26
HuggingFace maywell/Synatra-7B-v0.3-Translation 0.35 0.43 0.36 0.27 0.23 0.70 0.37 0.31 0.13 0.34 0.52 0.35 0.29 0.44 0.21 0.24 0.46 0.28 0.37
Cloud deepl 0.39 0.59 0.33 0.31 0.32 0.70 0.48 0.38 0.14 0.38 0.55 0.41 0.33 0.48 0.24 0.28 0.42 0.37 0.36
Cloud azure 0.40 0.57 0.36 0.35 0.29 0.63 0.46 0.39 0.16 0.38 0.56 0.39 0.33 0.54 0.22 0.29 0.52 0.35 0.41
Cloud google 0.40 0.62 0.39 0.32 0.32 0.60 0.45 0.45 0.14 0.38 0.59 0.43 0.34 0.45 0.22 0.28 0.47 0.39 0.36
Cloud papago 0.43 0.56 0.43 0.41 0.30 0.55 0.58 0.56 0.16 0.37 0.67 0.52 0.35 0.53 0.21 0.35 0.45 0.37 0.46
HuggingFace davidkim205/iris-7b (ours) 0.40 0.49 0.37 0.34 0.31 0.72 0.48 0.43 0.11 0.33 0.56 0.46 0.34 0.43 0.20 0.30 0.47 0.41 0.40

BLEU by sentence length

ํ…์ŠคํŠธ์˜ ๊ธธ์ด์— ๋”ฐ๋ผ 4๊ตฌ๊ฐ„์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ 50๊ฐœ์”ฉ ์ƒ˜ํ”Œ๋งํ•˜์—ฌ ๋ฒˆ์—ญํ•œ ํ‰๊ท  ์ ์ˆ˜์ž…๋‹ˆ๋‹ค. ํ‰๊ฐ€์— ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ์…‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • data/komt-dataset-100.jsonl
  • data/komt-dataset-500.jsonl
  • data/komt-dataset-1000.jsonl
  • data/komt-dataset-1500.jsonl

๋ฒˆ์—ญ ๋ฐ bleu score ๊ฒฐ๊ณผ๋Š” results_length/์•„๋ž˜์— ์ €์žฅ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

๋†€๋ž๊ฒŒ๋„, iris-7b ๋ชจ๋ธ์€ ๋ชจ๋“  ๊ตฌ๊ฐ„์—์„œ ๋Œ€๋ถ€๋ถ„์˜ ํด๋ผ์šฐ๋“œ ๋ฒˆ์—ญ๋ณด๋‹ค ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.

  • ~100: (0, 100]
  • ~500: (100, 500]
  • ~1000: (500, 1000]
  • ~1500: (1000, 1500]

plot-bleu-by-sentence-length.png

Type Model Average ~100(50) ~500(50) ~1000(50) ~1500(50)
HuggingFace facebook/nllb-200-distilled-1.3B 0.24 0.31 0.31 0.22 0.13
HuggingFace jbochi/madlad400-10b-mt 0.22 0.35 0.37 0.08 0.10
HuggingFace Unbabel/TowerInstruct-7B-v0.1 0.32 0.41 0.31 0.24 0.32
HuggingFace squarelike/Gugugo-koen-7B-V1.1 0.45 0.37 0.48 0.52 0.43
HuggingFace maywell/Synatra-7B-v0.3-Translation 0.50 0.41 0.57 0.57 0.51
Cloud deepl 0.53 0.44 0.56 0.64 0.50
Cloud azure 0.47 0.46 0.47 0.52 0.44
Cloud google 0.51 0.50 0.49 0.54 0.51
Cloud papago 0.46 0.50 0.46 0.43 0.45
HuggingFace davidkim205/iris-7b (ours) 0.56 0.51 0.58 0.62 0.54

test dataset info

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์€ 18๊ฐ€์ง€ ๋ถ„์•ผ์˜ ๋ฐ์ดํ„ฐ 10๊ฐœ๋กœ, ์ด 180๊ฐœ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.

koopus100 ๋ฐ์ดํ„ฐ์…‹์€ ๊ธธ์ด๊ฐ€ ์งง๊ณ  ์›๋ฌธ๊ณผ ๋ฒˆ์—ญ๋ฌธ์ด ์ผ์น˜ํ•˜์ง€ ์•Š๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์กด์žฌํ•˜์—ฌ ํ’ˆ์งˆ์ด ๋‚ฎ์Šต๋‹ˆ๋‹ค.

text: All right
translation: ๋ณ„๋กœ ๊ทธ๋Ÿด ๊ธฐ๋ถ„ ์•„๋‹ˆ์•ผ - I'm not in the mood.

text: Do you have a fever?
translation: ๋ญ๋ผ๊ณ  ํ–ˆ์–ด?

korean-parallel-corpora ๋ฐ์ดํ„ฐ์…‹์€ ๋ฒˆ์—ญ๋ฌธ์— ํ•œ์˜์ด ํ˜ผ์šฉ๋˜๊ฑฐ๋‚˜, ์ž˜๋ชป ๋ฒˆ์—ญ๋˜์–ด ํ’ˆ์งˆ์ด ๋‚ฎ์Šต๋‹ˆ๋‹ค.

text: S. Korea mulls missile defense system ํ•œ๊ตญ, ์ž์ฒด์  ๋ฏธ์‚ฌ์ผ ๋ฐฉ์–ด์ฒด๊ณ„ ์ˆ˜๋ฆฝ ๊ฒ€ํ†      2007.03
translation: South Korea maintains a mandatory draft system under which all able-bodied men over 20 must serve in the military for 24 to 27 months.

text: A United States intelligence agency has been collecting data on the phone calls of tens of millions of Americans, a report in USA Today has alleged.
translation: NSA collects Americansโ€™phone clall data๋ฏธ ๊ตญ๊ฐ€์•ˆ๋ณด๊ตญ, ๋ฏธ๊ตญ๋ฏผ ํ†ตํ™” ๋‚ด์šฉ ์ˆ˜์ง‘2006.07

text: I see the guy as more like John Wayne, which is to say I don't like his politics but he's endearing in a strange, goofy, awkward way, and he did capture the imagination of the country,\" he said.
translation: ๋ฒ ํŠธ๋‚จ์ „์— ์ฐธ์ „ํ–ˆ๋˜ ์Šคํ†ค ๊ฐ๋…์€ ๋น„ํŒ์ ์œผ๋กœ ํ˜ธํ‰์„ ๋ฐ›๊ณ  ์ •์น˜์ ์ธ ์„ฑํ–ฅ์ด ๋งŽ์€ ์˜ํ™”๋ฅผ ์ œ์ž‘ํ•œ ๊ฒƒ์œผ๋กœ ์œ ๋ช…ํ•˜๋‹ค.

text: The Sahara is advancing into Ghana and Nigeria at the rate of 3,510 square kilometers per year.
translation: ์นด์žํ์Šคํƒ„ ๋˜ํ•œ ์‚ฌ๋ง‰ํ™”๋กœ ์ธํ•ด 1980๋…„ ์ดํ›„ ๋†๊ฒฝ์ง€์˜ 50%๊ฐ€ ์‚ฌ๋ผ์กŒ์œผ๋ฉฐ ์‚ฌํ•˜๋ผ ์‚ฌ๋ง‰์€ ๋งค๋…„ 3510ใŽข์”ฉ ์ปค์ ธ๊ฐ€๋ฉฐ ๊ฐ€๋‚˜์™€ ๋‚˜์ด์ง€๋ฆฌ์•„๋ฅผ ์œ„ํ˜‘ํ•˜๊ณ  ์žˆ๋‹ค.

์•„๋ž˜ ํ‘œ์—๋Š” ๊ฐ src์˜ ๋น„์œจ๊ณผ ๊ฐœ์ˆ˜, ์„ค๋ช…์ด ์ •๋ฆฌ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

src ratio description
aihub-MTPE 5.56% ๊ธฐ๊ณ„๋ฒˆ์—ญ ํ’ˆ์งˆ ์‚ฌํ›„๊ฒ€์ฆ ๋ฐ์ดํ„ฐ์…‹
aihub-techsci2 5.56% ICT, ์ „๊ธฐ/์ „์ž ๋“ฑ ๊ธฐ์ˆ ๊ณผํ•™ ๋ถ„์•ผ ํ•œ์˜ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹
aihub-expertise 5.56% ์˜๋ฃŒ, ๊ธˆ์œต, ์Šคํฌ์ธ  ๋“ฑ ์ „๋ฌธ๋ถ„์•ผ ํ•œ์˜ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹
aihub-humanities 5.56% ์ธ๋ฌธํ•™ ๋ถ„์•ผ ํ•œ์˜ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹
sharegpt-deepl-ko-translation 5.56% shareGPT ๋ฐ์ดํ„ฐ์…‹์„ ์งˆ๋‹ต ํ˜•์‹์—์„œ ํ•œ์˜ ๋ฒˆ์—ญ ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•œ ๋ฐ์ดํ„ฐ์…‹
aihub-MT-new-corpus 5.56% ๊ธฐ๊ณ„ ๋ฒˆ์—ญ ์•ฑ ๊ตฌ์ถ•์šฉ ํ•œ์˜ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹
aihub-socialsci 5.56% ๋ฒ•๋ฅ , ๊ต์œก, ๊ฒฝ์ œ ๋“ฑ ์‚ฌํšŒ๊ณผํ•™ ๋ถ„์•ผ ํ•œ์˜ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹
korean-parallel-corpora 5.56% ํ•œ์˜ ๋ฒˆ์—ญ ๋ณ‘๋ ฌ ๋ฐ์ดํ„ฐ์…‹
aihub-parallel-translation 5.56% ๋ฐœํ™” ์œ ํ˜• ๋ฐ ๋ถ„์•ผ๋ณ„ ํ•œ์˜ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹
aihub-food 5.56% ์‹ํ’ˆ ๋ถ„์•ผ ์˜ํ•œ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹
aihub-techsci 5.56% ICT, ์ „๊ธฐ/์ „์ž ๋“ฑ ๊ธฐ์ˆ ๊ณผํ•™ ๋ถ„์•ผ ํ•œ์˜ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹
para_pat 5.56% ParaPat ๋ฐ์ดํ„ฐ์…‹์˜ ์˜์–ด-ํ•œ๊ตญ์–ด subset
aihub-speechtype-based-machine-translation 5.56% ๋ฐœํ™” ์œ ํ˜•๋ณ„ ์˜ํ•œ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹
koopus100 5.56% OPUS-100 ๋ฐ์ดํ„ฐ์…‹์˜ ์˜์–ด-ํ•œ๊ตญ์–ด subset
aihub-basicsci 5.56% ์ˆ˜ํ•™, ๋ฌผ๋ฆฌํ•™ ๋“ฑ ๊ธฐ์ดˆ๊ณผํ•™ ๋ถ„์•ผ ํ•œ์˜ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹
aihub-broadcast-content 5.56% ๋ฐฉ์†ก ์ฝ˜ํ…์ธ  ๋ถ„์•ผ ํ•œ์˜ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹
aihub-patent 5.56% ํŠนํ—ˆ๋ช…์„ธ์„œ ์˜ํ•œ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹
aihub-colloquial 5.56% ์‹ ์กฐ์–ด, ์•ฝ์–ด ๋“ฑ์„ ํฌํ•จํ•˜๋Š” ๊ตฌ์–ด์ฒด ํ•œ์˜ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹