NewsKoT5

The training data for this T5 model consists of Korean news articles (29GB). However, the performance has not been fine-tuned through the use of small batches and a limited number of training steps, so it may not be fully optimized.

Quick tour

from transformers import AutoTokenizer, T5ForConditionalGeneration
  
tokenizer = AutoTokenizer.from_pretrained("BM-K/NewsKoT5-small")
model = T5ForConditionalGeneration.from_pretrained("BM-K/NewsKoT5-small")

input_ids = tokenizer("한국형발사체 누리호가 실용급 <extra_id_0> 발사체로서 ‘데뷔’를 성공적으로 <extra_id_1>", return_tensors="pt").input_ids
labels = tokenizer("<extra_id_0> 위성 <extra_id_1> 마쳤다 <extra_id_2>", return_tensors="pt").input_ids

outputs = model(input_ids=input_ids,
                labels=labels)

News Summarization Performance (F1-score)

After restoring the model's tokenized output to the original text, Rouge performance was evaluated by comparing it to the reference and hypothesis tokenized using mecab.

Dacon 한국어 문서 생성요약 AI 경진대회 Dataset
- Training: 29,432
- Validation: 7,358
- Test: 9,182

	#Param	rouge-1	rouge-2	rouge-l
pko-t5-small	95M	51.48	33.18	44.96
NewsT5-small	61M	52.15	33.59	45.41

AI-Hub 문서요약 텍스트 Dataset
- Training: 245,626
- Validation: 20,296
- Test: 9,931

	#Param	rouge-1	rouge-2	rouge-l
pko-t5-small	95M	53.44	34.03	45.36
NewsT5-small	61M	53.74	34.27	45.52

pko-t5-small