language:
- ko
license: apache-2.0
library_name: transformers
tags:
- text2text-generation
datasets:
- aihub
metrics:
- bleu
- rouge
model-index:
- name: ko-barTNumText
results:
- task:
type: text2text-generation
name: text2text-generation
metrics:
- type: bleu
value: 0.9161441917016176
name: eval_bleu
verified: true
- type: rouge1
value: 0.9502159661745533
name: eval_rouge1
verified: true
- type: rouge2
value: 0.9313935147887745
name: eval_rouge2
verified: true
- type: rougeL
value: 0.950015374196916
name: eval_rougeL
verified: true
- type: rougeLsum
value: 0.9500390902948073
name: eval_rougeLsum
verified: true
ko-barTNumText(TNT Model๐งจ): Try Number To Korean Reading(์ซ์๋ฅผ ํ๊ธ๋ก ๋ฐ๊พธ๋ ๋ชจ๋ธ)
Table of Contents
Model Details
Model Description: ๋ญ๊ฐ ์ฐพ์๋ด๋ ๋ชจ๋ธ์ด๋ ์๊ณ ๋ฆฌ์ฆ์ด ๋ฑํ ์์ด์ ๋ง๋ค์ด๋ณธ ๋ชจ๋ธ์ ๋๋ค.
BartForConditionalGeneration Fine-Tuning Model For Number To Korean
BartForConditionalGeneration์ผ๋ก ํ์ธํ๋ํ, ์ซ์๋ฅผ ํ๊ธ๋ก ๋ณํํ๋ Task ์ ๋๋ค.Dataset use Korea aihub
I can't open my fine-tuning datasets for my private issue
๋ฐ์ดํฐ์ ์ Korea aihub์์ ๋ฐ์์ ์ฌ์ฉํ์์ผ๋ฉฐ, ํ์ธํ๋์ ์ฌ์ฉ๋ ๋ชจ๋ ๋ฐ์ดํฐ๋ฅผ ์ฌ์ ์ ๊ณต๊ฐํด๋๋ฆด ์๋ ์์ต๋๋ค.Korea aihub data is ONLY permit to Korean!!!!!!!
aihub์์ ๋ฐ์ดํฐ๋ฅผ ๋ฐ์ผ์ค ๋ถ์ ํ๊ตญ์ธ์ผ ๊ฒ์ด๋ฏ๋ก, ํ๊ธ๋ก๋ง ์์ฑํฉ๋๋ค.
์ ํํ๋ ์์ฑ์ ์ฌ๋ฅผ ์ฒ ์์ ์ฌ๋ก ๋ฒ์ญํ๋ ํํ๋ก ํ์ต๋ ๋ชจ๋ธ์ ๋๋ค. (ETRI ์ ์ฌ๊ธฐ์ค)In case, ten million, some people use 10 million or some people use 10000000, so this model is crucial for training datasets ์ฒ๋ง์ 1000๋ง ํน์ 10000000์ผ๋ก ์ธ ์๋ ์๊ธฐ์, Training Datasets์ ๋ฐ๋ผ ๊ฒฐ๊ณผ๋ ์์ดํ ์ ์์ต๋๋ค.
Developed by: Yoo SungHyun(https://github.com/YooSungHyun)
Language(s): Korean
License: apache-2.0
Parent Model: See the kobart-base-v2 for more information about the pre-trained base model.
Uses
This Model is inferenced token BACKWARD. so, you have to flip
before tokenizer.decode()
ํด๋น ๋ชจ๋ธ์ inference์ ์ญ์์ผ๋ก ์์ธกํฉ๋๋ค. (๋ฐฅ์ 6์์ ๋จน์์ด -> ์ด ๋จน์ ์์ ์ฌ์ฏ ์ ๋ฐฅ)
๋๋ฌธ์ tokenizer.decode
๋ฅผ ์ํํ๊ธฐ ์ ์, flip
์ผ๋ก ์ญ์์ผ๋ก ์นํํด์ฃผ์ธ์.
Want see more detail follow this URL KoGPT_num_converter
and see bart_inference.py
and bart_train.py
class BartText2TextGenerationPipeline(Text2TextGenerationPipeline):
def postprocess(self, model_outputs, return_type=ReturnType.TEXT, clean_up_tokenization_spaces=False):
records = []
reversed_model_outputs = torch.flip(model_outputs["output_ids"][0], dims=[-1])
for output_ids in reversed_model_outputs:
if return_type == ReturnType.TENSORS:
record = {f"{self.return_name}_token_ids": output_ids}
elif return_type == ReturnType.TEXT:
record = {
f"{self.return_name}_text": self.tokenizer.decode(
output_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=clean_up_tokenization_spaces,
)
}
records.append(record)
return records
Evaluation
Just using evaluate-metric/bleu
and evaluate-metric/rouge
in huggingface evaluate
library
How to Get Started With the Model
from transformers.pipelines import Text2TextGenerationPipeline
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
texts = ["๊ทธ๋ฌ๊ฒ ๋๊ฐ 6์๊น์ง ์ ์ ๋ง์๋?"]
tokenizer = AutoTokenizer.from_pretrained(
inference_args.model_name_or_path,
)
model = AutoModelForSeq2SeqLM.from_pretrained(
inference_args.model_name_or_path,
)
# BartText2TextGenerationPipeline is implemented above (see 'Use')
seq2seqlm_pipeline = BartText2TextGenerationPipeline(model=model, tokenizer=tokenizer)
kwargs = {
"min_length": args.min_length,
"max_length": args.max_length,
"num_beams": args.beam_width,
"do_sample": args.do_sample,
"num_beam_groups": args.num_beam_groups,
}
pred = seq2seqlm_pipeline(texts, **kwargs)
print(pred)
# ๊ทธ๋ฌ๊ฒ ๋๊ฐ ์ฌ์ฏ ์๊น์ง ์ ์ ๋ง์๋?