|
--- |
|
license: apache-2.0 |
|
base_model: KETI-AIR/ke-t5-base |
|
tags: |
|
- generated_from_trainer |
|
model-index: |
|
- name: ke_t5_base_bongsoo_en_ko |
|
results: [] |
|
--- |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
# ke_t5_base_bongsoo_en_ko |
|
|
|
This model is a fine-tuned version of [KETI-AIR/ke-t5-base](https://huggingface.co/KETI-AIR/ke-t5-base) |
|
on the [bongsoo/news_news_talk_en_ko](https://huggingface.co/datasets/bongsoo/news_talk_en_ko) dataset. |
|
See [translation_ke_t5_base_bongsoo_en_ko.ipynb](https://github.com/chunwoolee0/ko-nlp/blob/main/translation_ke_t5_base_bongsoo_en_ko.ipynb) |
|
|
|
## Model description |
|
|
|
KE-T5 is a pretrained-model of t5 text-to-text transfer transformers using the Korean and English corpus developed by KETI (ํ๊ตญ์ ์์ฐ๊ตฌ์). |
|
The vocabulary used by KE-T5 consists of 64,000 sub-word tokens and was created using Google's sentencepiece. The Sentencepiece model was trained to cover 99.95% of a 30GB corpus with an approximate 7:3 mix of Korean and English. |
|
|
|
## Intended uses & limitations |
|
|
|
Translation from English to Korean |
|
|
|
## Usage |
|
|
|
You can use this model directly with a pipeline for translation language modeling: |
|
|
|
```python |
|
>>> from transformers import pipeline |
|
>>> translator = pipeline('translation', model='chunwoolee0/ke_t5_base_bongsoo_en_ko') |
|
|
|
>>> translator("Let us go for a walk after lunch.") |
|
[{'translation_text': '์ ์ฌ์ ๋ง์น๊ณ ์ฐ์ฑ
์ ํ๋ฌ ๊ฐ์.'}] |
|
|
|
>>> translator("The BRICS countries welcomed six new members from three different continents on Thursday.") |
|
[{'translation_text': '๋ธ๋ฆญ์ค ๊ตญ๊ฐ๋ค์ ์ง๋ 24์ผ 3๊ฐ ๋๋ฅ 6๋ช
์ ์ ๊ท ํ์์ ํ์ํ๋ค.'}] |
|
|
|
>>> translator("The BRICS countries welcomed six new members from three different continents on Thursday, marking a historic milestone that underscored the solidarity of BRICS and developing countries and determination to work together for a better future, officials and experts said.",max_length=400) |
|
[{'translation_text': '๋ธ๋ ์ค ๊ตญ๊ฐ๋ ์ง๋ 7์ผ 3๊ฐ ๋๋ฅ 6๋ช
์ ์ ๊ท ํ์์ ํ์ํ๋ฉฐ BRICS์ ๊ฐ๋ฐ๋์๊ตญ์ ์ฐ๋์ ๋ ๋์ ๋ฏธ๋๋ฅผ ์ํด ํจ๊ป ๋
ธ๋ ฅํ๊ฒ ๋ค๋ ์์ง๋ฅผ ์ฌํ์ธํ ์ญ์ฌ์ ์ธ ์ด์ ํ๋ฅผ ์ฅ์ํ๋ค๊ณ ๊ด๊ณ์๋ค๊ณผ ์ ๋ฌธ๊ฐ๋ค์ ์ ํ๋ค.'}] |
|
|
|
>>> translator("Bidenโs decree zaps lucrative investments in Chinaโs chip and AI sectors") |
|
[{'translation_text': '๋ฐ์ด๋ ์ฅ๊ด์ ํ์ ๋ช
๋ น์ ์ค๊ตญ ์นฉ๊ณผ AI ๋ถ์ผ์ ๊ณ ์์ต ํฌ์๋ฅผ ์ฅ์ฃ๋ ๊ฒ์ด๋ค.'}] |
|
|
|
>>> translator("It is most likely that Chinaโs largest chip foundry, a key piece of the puzzle in Beijingโs efforts to achieve greater self-sufficiency in semiconductors, would not have been able to set up its first plant in Shanghaiโs suburbs in the early 2000s without funding from American investors such as Walden International and Goldman Sachs.", max_length=400) |
|
[{'translation_text': '๋ฐ๋์ฒด์ ๋ ํฐ ์๋ฆฝ์ฑ์ ์ด๋ฃจ๊ธฐ ์ํด ๋ฒ ์ด์ง์ด ์ ์ฐ๋ ํผ์ฆ์ ํต์ฌ ์กฐ๊ฐ์ธ ์ค๊ตญ ์ต๋ ์นฉ ํ์ด๋๋ฆฌ๊ฐ ์๋ด์ธํฐ๋ด์
๋, ๊ณจ๋๋ง์ญ์ค ๋ฑ ๋ฏธ๊ตญ ํฌ์์๋ก๋ถํฐ ์๊ธ ์ง์์ ๋ฐ์ง ๋ชปํ ์ฑ 2000๋
๋ ์ด ์ํ์ด ์๋ด์ ์ฒซ ๊ณต์ฅ์ ์ง์ ์ ์์์ ๊ฐ๋ฅ์ฑ์ด ํฌ๋ค.'}] |
|
|
|
## Training and evaluation data |
|
|
|
One third of the original training data size of 1200000 is selected because of the resource limit of the colab of google. |
|
|
|
## Training procedure |
|
|
|
Because of the limitation of google's colab, the model is trained only by one epoch. The result is still quite satisfactory. The quality of translation is not so bad. |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 0.0005 |
|
- train_batch_size: 32 |
|
- eval_batch_size: 32 |
|
- seed: 42 |
|
- gradient_accumulation_steps: 2 |
|
- total_train_batch_size: 64 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: linear |
|
- num_epochs: 1 |
|
|
|
### Training results |
|
|
|
| Training Loss | Epoch | Step | Validation Loss | Bleu | |
|
|:-------------:|:-----:|:----:|:---------------:|:------:| |
|
| No log | 1.0 | 5625 | 2.4075 | 8.2272 | |
|
|
|
- cpu usage: 4.8/12.7GB |
|
- gpu usage: 13.0/15.0GB |
|
- running time: 3h |
|
|
|
### Framework versions |
|
|
|
- Transformers 4.32.0 |
|
- Pytorch 2.0.1+cu118 |
|
- Datasets 2.14.4 |
|
- Tokenizers 0.13.3 |
|
|