ke_t5_base_bongsoo_en_ko
This model is a fine-tuned version of KETI-AIR/ke-t5-base on the bongsoo/news_news_talk_en_ko dataset. See translation_ke_t5_base_bongsoo_en_ko.ipynb
Model description
KE-T5 is a pretrained-model of t5 text-to-text transfer transformers using the Korean and English corpus developed by KETI (ํ๊ตญ์ ์์ฐ๊ตฌ์). The vocabulary used by KE-T5 consists of 64,000 sub-word tokens and was created using Google's sentencepiece. The Sentencepiece model was trained to cover 99.95% of a 30GB corpus with an approximate 7:3 mix of Korean and English.
Intended uses & limitations
Translation from English to Korean
Usage
You can use this model directly with a pipeline for translation language modeling:
>>> from transformers import pipeline
>>> translator = pipeline('translation', model='chunwoolee0/ke_t5_base_bongsoo_en_ko')
>>> translator("Let us go for a walk after lunch.")
[{'translation_text': '์ ์ฌ์ ๋ง์น๊ณ ์ฐ์ฑ
์ ํ๋ฌ ๊ฐ์.'}]
>>> translator("The BRICS countries welcomed six new members from three different continents on Thursday.")
[{'translation_text': '๋ธ๋ฆญ์ค ๊ตญ๊ฐ๋ค์ ์ง๋ 24์ผ 3๊ฐ ๋๋ฅ 6๋ช
์ ์ ๊ท ํ์์ ํ์ํ๋ค.'}]
>>> translator("The BRICS countries welcomed six new members from three different continents on Thursday, marking a historic milestone that underscored the solidarity of BRICS and developing countries and determination to work together for a better future, officials and experts said.",max_length=400)
[{'translation_text': '๋ธ๋ ์ค ๊ตญ๊ฐ๋ ์ง๋ 7์ผ 3๊ฐ ๋๋ฅ 6๋ช
์ ์ ๊ท ํ์์ ํ์ํ๋ฉฐ BRICS์ ๊ฐ๋ฐ๋์๊ตญ์ ์ฐ๋์ ๋ ๋์ ๋ฏธ๋๋ฅผ ์ํด ํจ๊ป ๋
ธ๋ ฅํ๊ฒ ๋ค๋ ์์ง๋ฅผ ์ฌํ์ธํ ์ญ์ฌ์ ์ธ ์ด์ ํ๋ฅผ ์ฅ์ํ๋ค๊ณ ๊ด๊ณ์๋ค๊ณผ ์ ๋ฌธ๊ฐ๋ค์ ์ ํ๋ค.'}]
>>> translator("Bidenโs decree zaps lucrative investments in Chinaโs chip and AI sectors")
[{'translation_text': '๋ฐ์ด๋ ์ฅ๊ด์ ํ์ ๋ช
๋ น์ ์ค๊ตญ ์นฉ๊ณผ AI ๋ถ์ผ์ ๊ณ ์์ต ํฌ์๋ฅผ ์ฅ์ฃ๋ ๊ฒ์ด๋ค.'}]
>>> translator("It is most likely that Chinaโs largest chip foundry, a key piece of the puzzle in Beijingโs efforts to achieve greater self-sufficiency in semiconductors, would not have been able to set up its first plant in Shanghaiโs suburbs in the early 2000s without funding from American investors such as Walden International and Goldman Sachs.", max_length=400)
[{'translation_text': '๋ฐ๋์ฒด์ ๋ ํฐ ์๋ฆฝ์ฑ์ ์ด๋ฃจ๊ธฐ ์ํด ๋ฒ ์ด์ง์ด ์ ์ฐ๋ ํผ์ฆ์ ํต์ฌ ์กฐ๊ฐ์ธ ์ค๊ตญ ์ต๋ ์นฉ ํ์ด๋๋ฆฌ๊ฐ ์๋ด์ธํฐ๋ด์
๋, ๊ณจ๋๋ง์ญ์ค ๋ฑ ๋ฏธ๊ตญ ํฌ์์๋ก๋ถํฐ ์๊ธ ์ง์์ ๋ฐ์ง ๋ชปํ ์ฑ 2000๋
๋ ์ด ์ํ์ด ์๋ด์ ์ฒซ ๊ณต์ฅ์ ์ง์ ์ ์์์ ๊ฐ๋ฅ์ฑ์ด ํฌ๋ค.'}]
## Training and evaluation data
One third of the original training data size of 1200000 is selected because of the resource limit of the colab of google.
## Training procedure
Because of the limitation of google's colab, the model is trained only by one epoch. The result is still quite satisfactory. The quality of translation is not so bad.
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0005
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 64
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 1
### Training results
| Training Loss | Epoch | Step | Validation Loss | Bleu |
|:-------------:|:-----:|:----:|:---------------:|:------:|
| No log | 1.0 | 5625 | 2.4075 | 8.2272 |
- cpu usage: 4.8/12.7GB
- gpu usage: 13.0/15.0GB
- running time: 3h
### Framework versions
- Transformers 4.32.0
- Pytorch 2.0.1+cu118
- Datasets 2.14.4
- Tokenizers 0.13.3
- Downloads last month
- 20
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for chunwoolee0/ke_t5_base_bongsoo_en_ko
Base model
KETI-AIR/ke-t5-base