ko-barTNumText / README.md
lIlBrother's picture
Update: ๋ชจ๋ธ ๋‚ด์šฉ
9c20e00
metadata
language:
  - ko
license: apache-2.0
library_name: transformers
tags:
  - text2text-generation
datasets:
  - aihub
metrics:
  - bleu
  - rouge
model-index:
  - name: ko-barTNumText
    results:
      - task:
          type: text2text-generation
          name: text2text-generation
        metrics:
          - type: bleu
            value: 0.9161441917016176
            name: eval_bleu
            verified: true
          - type: rouge1
            value: 0.9502159661745533
            name: eval_rouge1
            verified: true
          - type: rouge2
            value: 0.9313935147887745
            name: eval_rouge2
            verified: true
          - type: rougeL
            value: 0.950015374196916
            name: eval_rougeL
            verified: true
          - type: rougeLsum
            value: 0.9500390902948073
            name: eval_rougeLsum
            verified: true

ko-barTNumText(TNT Model๐Ÿงจ): Try Number To Korean Reading(์ˆซ์ž๋ฅผ ํ•œ๊ธ€๋กœ ๋ฐ”๊พธ๋Š” ๋ชจ๋ธ)

Table of Contents

Model Details

  • Model Description: ๋ญ”๊ฐ€ ์ฐพ์•„๋ด๋„ ๋ชจ๋ธ์ด๋‚˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋”ฑํžˆ ์—†์–ด์„œ ๋งŒ๋“ค์–ด๋ณธ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
    BartForConditionalGeneration Fine-Tuning Model For Number To Korean
    BartForConditionalGeneration์œผ๋กœ ํŒŒ์ธํŠœ๋‹ํ•œ, ์ˆซ์ž๋ฅผ ํ•œ๊ธ€๋กœ ๋ณ€ํ™˜ํ•˜๋Š” Task ์ž…๋‹ˆ๋‹ค.

  • Dataset use Korea aihub
    I can't open my fine-tuning datasets for my private issue
    ๋ฐ์ดํ„ฐ์…‹์€ Korea aihub์—์„œ ๋ฐ›์•„์„œ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ, ํŒŒ์ธํŠœ๋‹์— ์‚ฌ์šฉ๋œ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์ •์ƒ ๊ณต๊ฐœํ•ด๋“œ๋ฆด ์ˆ˜๋Š” ์—†์Šต๋‹ˆ๋‹ค.

  • Korea aihub data is ONLY permit to Korean!!!!!!!
    aihub์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์œผ์‹ค ๋ถ„์€ ํ•œ๊ตญ์ธ์ผ ๊ฒƒ์ด๋ฏ€๋กœ, ํ•œ๊ธ€๋กœ๋งŒ ์ž‘์„ฑํ•ฉ๋‹ˆ๋‹ค.
    ์ •ํ™•ํžˆ๋Š” ์Œ์„ฑ์ „์‚ฌ๋ฅผ ์ฒ ์ž์ „์‚ฌ๋กœ ๋ฒˆ์—ญํ•˜๋Š” ํ˜•ํƒœ๋กœ ํ•™์Šต๋œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. (ETRI ์ „์‚ฌ๊ธฐ์ค€)

  • In case, ten million, some people use 10 million or some people use 10000000, so this model is crucial for training datasets ์ฒœ๋งŒ์„ 1000๋งŒ ํ˜น์€ 10000000์œผ๋กœ ์“ธ ์ˆ˜๋„ ์žˆ๊ธฐ์—, Training Datasets์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ๋Š” ์ƒ์ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • Developed by: Yoo SungHyun(https://github.com/YooSungHyun)

  • Language(s): Korean

  • License: apache-2.0

  • Parent Model: See the kobart-base-v2 for more information about the pre-trained base model.

Uses

This Model is inferenced token BACKWARD. so, you have to flip before tokenizer.decode() ํ•ด๋‹น ๋ชจ๋ธ์€ inference์‹œ ์—ญ์ˆœ์œผ๋กœ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. (๋ฐฅ์„ 6์‹œ์— ๋จน์—ˆ์–ด -> ์–ด ๋จน์—ˆ ์‹œ์— ์—ฌ์„ฏ ์„ ๋ฐฅ)
๋•Œ๋ฌธ์— tokenizer.decode๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์ „์—, flip์œผ๋กœ ์—ญ์ˆœ์œผ๋กœ ์น˜ํ™˜ํ•ด์ฃผ์„ธ์š”.

Want see more detail follow this URL KoGPT_num_converter
and see bart_inference.py and bart_train.py

class BartText2TextGenerationPipeline(Text2TextGenerationPipeline):
    def postprocess(self, model_outputs, return_type=ReturnType.TEXT, clean_up_tokenization_spaces=False):
        records = []
        reversed_model_outputs = torch.flip(model_outputs["output_ids"][0], dims=[-1])
        for output_ids in reversed_model_outputs:
            if return_type == ReturnType.TENSORS:
                record = {f"{self.return_name}_token_ids": output_ids}
            elif return_type == ReturnType.TEXT:
                record = {
                    f"{self.return_name}_text": self.tokenizer.decode(
                        output_ids,
                        skip_special_tokens=True,
                        clean_up_tokenization_spaces=clean_up_tokenization_spaces,
                    )
                }
            records.append(record)
        return records

Evaluation

Just using evaluate-metric/bleu and evaluate-metric/rouge in huggingface evaluate library

How to Get Started With the Model

from transformers.pipelines import Text2TextGenerationPipeline
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
texts = ["๊ทธ๋Ÿฌ๊ฒŒ ๋ˆ„๊ฐ€ 6์‹œ๊นŒ์ง€ ์ˆ ์„ ๋งˆ์‹œ๋ž˜?"]
tokenizer = AutoTokenizer.from_pretrained(
    inference_args.model_name_or_path,
)
model = AutoModelForSeq2SeqLM.from_pretrained(
    inference_args.model_name_or_path,
)
# BartText2TextGenerationPipeline is implemented above (see 'Use')
seq2seqlm_pipeline = BartText2TextGenerationPipeline(model=model, tokenizer=tokenizer)
kwargs = {
    "min_length": args.min_length,
    "max_length": args.max_length,
    "num_beams": args.beam_width,
    "do_sample": args.do_sample,
    "num_beam_groups": args.num_beam_groups,
}
pred = seq2seqlm_pipeline(texts, **kwargs)
print(pred)
# ๊ทธ๋Ÿฌ๊ฒŒ ๋ˆ„๊ฐ€ ์—ฌ์„ฏ ์‹œ๊นŒ์ง€ ์ˆ ์„ ๋งˆ์‹œ๋ž˜?