File size: 7,557 Bytes
9cdbee3
a9f1e24
 
 
 
 
7d05d99
a9f1e24
 
 
 
 
 
 
 
70652c1
a9f1e24
 
e2a78ee
 
a9f1e24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e2a78ee
 
e97ab07
e2a78ee
 
e97ab07
e2a78ee
 
 
 
 
 
 
 
 
 
 
 
 
9c20e00
e2a78ee
 
 
9c20e00
e2a78ee
 
 
4e2c89c
e2a78ee
4e2c89c
 
e2a78ee
 
 
 
 
 
 
380a458
9c20e00
 
e2a78ee
9c20e00
e2a78ee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26dbdb8
32f045f
e2a78ee
 
 
 
 
 
b3086d6
e2a78ee
 
b3086d6
e2a78ee
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
language:
- ko  # Example: fr
license: apache-2.0  # Example: apache-2.0 or any license from https://hf.co/docs/hub/repositories-licenses
library_name: transformers  # Optional. Example: keras or any library from https://github.com/huggingface/hub-docs/blob/main/js/src/lib/interfaces/Libraries.ts
tags:
- text2text-generation  # Example: audio
datasets:
- aihub  # Example: common_voice. Use dataset id from https://hf.co/datasets
metrics:
- bleu  # Example: wer. Use metric id from https://hf.co/metrics
- rouge

# Optional. Add this if you want to encode your eval results in a structured way.
model-index:
- name: ko-barTNumText
  results:
  - task:
      type: text2text-generation             # Required. Example: automatic-speech-recognition
      name: text2text-generation             # Optional. Example: Speech Recognition
    metrics:
      - type: bleu         # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 0.9161441917016176       # Required. Example: 20.90
        name: eval_bleu         # Optional. Example: Test WER
        verified: true              # Optional. If true, indicates that evaluation was generated by Hugging Face (vs. self-reported).
      - type: rouge1         # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 0.9502159661745533       # Required. Example: 20.90
        name: eval_rouge1         # Optional. Example: Test WER
        verified: true              # Optional. If true, indicates that evaluation was generated by Hugging Face (vs. self-reported).
      - type: rouge2         # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 0.9313935147887745       # Required. Example: 20.90
        name: eval_rouge2       # Optional. Example: Test WER
        verified: true              # Optional. If true, indicates that evaluation was generated by Hugging Face (vs. self-reported).
      - type: rougeL         # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 0.950015374196916       # Required. Example: 20.90
        name: eval_rougeL        # Optional. Example: Test WER
        verified: true              # Optional. If true, indicates that evaluation was generated by Hugging Face (vs. self-reported).
      - type: rougeLsum         # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 0.9500390902948073       # Required. Example: 20.90
        name: eval_rougeLsum        # Optional. Example: Test WER
        verified: true              # Optional. If true, indicates that evaluation was generated by Hugging Face (vs. self-reported).
---

# ko-barTNumText(TNT Model๐Ÿงจ): Try Number To Korean Reading(์ˆซ์ž๋ฅผ ํ•œ๊ธ€๋กœ ๋ฐ”๊พธ๋Š” ๋ชจ๋ธ)

## Table of Contents
- [ko-barTNumText(TNT Model๐Ÿงจ): Try Number To Korean Reading(์ˆซ์ž๋ฅผ ํ•œ๊ธ€๋กœ ๋ฐ”๊พธ๋Š” ๋ชจ๋ธ)](#ko-bartnumtexttnt-model-try-number-to-korean-reading์ˆซ์ž๋ฅผ-ํ•œ๊ธ€๋กœ-๋ฐ”๊พธ๋Š”-๋ชจ๋ธ)
  - [Table of Contents](#table-of-contents)
  - [Model Details](#model-details)
  - [Uses](#uses)
  - [Evaluation](#evaluation)
  - [How to Get Started With the Model](#how-to-get-started-with-the-model)


## Model Details
- **Model Description:**
๋ญ”๊ฐ€ ์ฐพ์•„๋ด๋„ ๋ชจ๋ธ์ด๋‚˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋”ฑํžˆ ์—†์–ด์„œ ๋งŒ๋“ค์–ด๋ณธ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. <br />
BartForConditionalGeneration Fine-Tuning Model For Number To Korean <br />
BartForConditionalGeneration์œผ๋กœ ํŒŒ์ธํŠœ๋‹ํ•œ, ์ˆซ์ž๋ฅผ ํ•œ๊ธ€๋กœ ๋ณ€ํ™˜ํ•˜๋Š” Task ์ž…๋‹ˆ๋‹ค. <br />

- Dataset use [Korea aihub](https://aihub.or.kr/aihubdata/data/list.do?currMenu=115&topMenu=100&srchDataRealmCode=REALM002&srchDataTy=DATA004) <br />
I can't open my fine-tuning datasets for my private issue <br />
๋ฐ์ดํ„ฐ์…‹์€ Korea aihub์—์„œ ๋ฐ›์•„์„œ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ, ํŒŒ์ธํŠœ๋‹์— ์‚ฌ์šฉ๋œ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์ •์ƒ ๊ณต๊ฐœํ•ด๋“œ๋ฆด ์ˆ˜๋Š” ์—†์Šต๋‹ˆ๋‹ค. <br />

- Korea aihub data is ONLY permit to Korean!!!!!!! <br />
aihub์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์œผ์‹ค ๋ถ„์€ ํ•œ๊ตญ์ธ์ผ ๊ฒƒ์ด๋ฏ€๋กœ, ํ•œ๊ธ€๋กœ๋งŒ ์ž‘์„ฑํ•ฉ๋‹ˆ๋‹ค. <br />
์ •ํ™•ํžˆ๋Š” ์Œ์„ฑ์ „์‚ฌ๋ฅผ ์ฒ ์ž์ „์‚ฌ๋กœ ๋ฒˆ์—ญํ•˜๋Š” ํ˜•ํƒœ๋กœ ํ•™์Šต๋œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. (ETRI ์ „์‚ฌ๊ธฐ์ค€) <br />

- In case, ten million, some people use 10 million or some people use 10000000, so this model is crucial for training datasets <br />
์ฒœ๋งŒ์„ 1000๋งŒ ํ˜น์€ 10000000์œผ๋กœ ์“ธ ์ˆ˜๋„ ์žˆ๊ธฐ์—, Training Datasets์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ๋Š” ์ƒ์ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. <br />
**์ˆ˜๊ด€ํ˜•์‚ฌ์™€ ์ˆ˜ ์˜์กด๋ช…์‚ฌ์˜ ๋„์–ด์“ฐ๊ธฐ์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ๊ฐ€ ํ™•์—ฐํžˆ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (์‰ฐ์‚ด, ์‰ฐ ์‚ด -> ์‰ฐ์‚ด, 50์‚ด)** https://eretz2.tistory.com/34 <br />
์ผ๋‹จ์€ ๊ธฐ์ค€์„ ์žก๊ณ  ์น˜์šฐ์น˜๊ฒŒ ํ•™์Šต์‹œํ‚ค๊ธฐ์—” ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉ๋ ์ง€ ๋ชฐ๋ผ, ํ•™์Šต ๋ฐ์ดํ„ฐ ๋ถ„ํฌ์— ๋งก๊ธฐ๋„๋ก ํ–ˆ์Šต๋‹ˆ๋‹ค. (์‰ฐ ์‚ด์ด ๋” ๋งŽ์„๊นŒ ์‰ฐ์‚ด์ด ๋” ๋งŽ์„๊นŒ!?)
- **Developed by:**  Yoo SungHyun(https://github.com/YooSungHyun)
- **Language(s):** Korean
- **License:** apache-2.0
- **Parent Model:** See the [kobart-base-v2](https://huggingface.co/gogamza/kobart-base-v2) for more information about the pre-trained base model.
  
  
## Uses
This Model is inferenced token BACKWARD. so, you have to `flip` before `tokenizer.decode()` <br />
ํ•ด๋‹น ๋ชจ๋ธ์€ inference์‹œ ์—ญ์ˆœ์œผ๋กœ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. (๋ฐฅ์„ 6์‹œ์— ๋จน์—ˆ์–ด -> ์–ด ๋จน์—ˆ ์‹œ์— ์—ฌ์„ฏ ์„ ๋ฐฅ) <br />
๋•Œ๋ฌธ์— `tokenizer.decode`๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์ „์—, `flip`์œผ๋กœ ์—ญ์ˆœ์œผ๋กœ ์น˜ํ™˜ํ•ด์ฃผ์„ธ์š”.

Want see more detail follow this URL [KoGPT_num_converter](https://github.com/ddobokki/KoGPT_num_converter) <br /> and see `bart_inference.py` and `bart_train.py`
```python
class BartText2TextGenerationPipeline(Text2TextGenerationPipeline):
    def postprocess(self, model_outputs, return_type=ReturnType.TEXT, clean_up_tokenization_spaces=False):
        records = []
        reversed_model_outputs = torch.flip(model_outputs["output_ids"][0], dims=[-1])
        for output_ids in reversed_model_outputs:
            if return_type == ReturnType.TENSORS:
                record = {f"{self.return_name}_token_ids": output_ids}
            elif return_type == ReturnType.TEXT:
                record = {
                    f"{self.return_name}_text": self.tokenizer.decode(
                        output_ids,
                        skip_special_tokens=True,
                        clean_up_tokenization_spaces=clean_up_tokenization_spaces,
                    )
                }
            records.append(record)
        return records
```
## Evaluation
Just using `evaluate-metric/bleu` and `evaluate-metric/rouge` in huggingface `evaluate` library <br />
[Training wanDB URL](https://wandb.ai/bart_tadev/BartForConditionalGeneration/runs/2dt1d2b0?workspace=user-bart_tadev)
## How to Get Started With the Model
```python
from transformers.pipelines import Text2TextGenerationPipeline
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
texts = ["๊ทธ๋Ÿฌ๊ฒŒ ๋ˆ„๊ฐ€ 6์‹œ๊นŒ์ง€ ์ˆ ์„ ๋งˆ์‹œ๋ž˜?"]
tokenizer = AutoTokenizer.from_pretrained(
    args.model_name_or_path,
)
model = AutoModelForSeq2SeqLM.from_pretrained(
    args.model_name_or_path,
)
# BartText2TextGenerationPipeline is implemented above (see 'Use')
seq2seqlm_pipeline = BartText2TextGenerationPipeline(model=model, tokenizer=tokenizer)
kwargs = {
    "min_length": args.min_length,
    "max_length": args.max_length,
    "num_beams": args.beam_width,
    "do_sample": args.do_sample,
    "num_beam_groups": args.num_beam_groups,
}
pred = seq2seqlm_pipeline(texts, **kwargs)
print(pred)
# ๊ทธ๋Ÿฌ๊ฒŒ ๋ˆ„๊ฐ€ ์—ฌ์„ฏ ์‹œ๊นŒ์ง€ ์ˆ ์„ ๋งˆ์‹œ๋ž˜?
```