NewsKoT5-small / README.md
BM-K's picture
Create README.md
99e4cf2
|
raw
history blame
1.94 kB
metadata
language:
  - ko

NewsKoT5

The training data for this T5 model consists of Korean news articles (29GB). However, the performance has not been fine-tuned through the use of small batches and a limited number of training steps, so it may not be fully optimized.

Quick tour

from transformers import AutoTokenizer, T5ForConditionalGeneration
  
tokenizer = AutoTokenizer.from_pretrained("BM-K/NewsKoT5-small")
model = T5ForConditionalGeneration.from_pretrained("BM-K/NewsKoT5-small")

input_ids = tokenizer("ํ•œ๊ตญํ˜•๋ฐœ์‚ฌ์ฒด ๋ˆ„๋ฆฌํ˜ธ๊ฐ€ ์‹ค์šฉ๊ธ‰ <extra_id_0> ๋ฐœ์‚ฌ์ฒด๋กœ์„œ โ€˜๋ฐ๋ท”โ€™๋ฅผ ์„ฑ๊ณต์ ์œผ๋กœ <extra_id_1>", return_tensors="pt").input_ids
labels = tokenizer("<extra_id_0> ์œ„์„ฑ <extra_id_1> ๋งˆ์ณค๋‹ค <extra_id_2>", return_tensors="pt").input_ids

outputs = model(input_ids=input_ids,
                labels=labels)

News Summarization Performance (F1-score)

After restoring the model's tokenized output to the original text, Rouge performance was evaluated by comparing it to the reference and hypothesis tokenized using mecab.

  • Dacon ํ•œ๊ตญ์–ด ๋ฌธ์„œ ์ƒ์„ฑ์š”์•ฝ AI ๊ฒฝ์ง„๋Œ€ํšŒ Dataset
    • Training: 29,432
    • Validation: 7,358
    • Test: 9,182
#Param rouge-1 rouge-2 rouge-l
pko-t5-small 95M 51.48 33.18 44.96
NewsT5-small 61M 52.15 33.59 45.41
  • AI-Hub ๋ฌธ์„œ์š”์•ฝ ํ…์ŠคํŠธ Dataset
    • Training: 245,626
    • Validation: 20,296
    • Test: 9,931
#Param rouge-1 rouge-2 rouge-l
pko-t5-small 95M 53.44 34.03 45.36
NewsT5-small 61M 53.74 34.27 45.52