|
--- |
|
language: |
|
- ko |
|
license: apache-2.0 |
|
library_name: transformers |
|
tags: |
|
- text2text-generation |
|
datasets: |
|
- aihub |
|
metrics: |
|
- bleu |
|
- rouge |
|
|
|
|
|
model-index: |
|
- name: ko-barTNumText |
|
results: |
|
- task: |
|
type: text2text-generation |
|
name: text2text-generation |
|
metrics: |
|
- type: bleu |
|
value: 0.9161441917016176 |
|
name: eval_bleu |
|
verified: true |
|
- type: rouge1 |
|
value: 0.9502159661745533 |
|
name: eval_rouge1 |
|
verified: true |
|
- type: rouge2 |
|
value: 0.9313935147887745 |
|
name: eval_rouge2 |
|
verified: true |
|
- type: rougeL |
|
value: 0.950015374196916 |
|
name: eval_rougeL |
|
verified: true |
|
- type: rougeLsum |
|
value: 0.9500390902948073 |
|
name: eval_rougeLsum |
|
verified: true |
|
--- |
|
|
|
# ko-barTNumText(TNT Model๐งจ): Try Number To Korean Reading(์ซ์๋ฅผ ํ๊ธ๋ก ๋ฐ๊พธ๋ ๋ชจ๋ธ) |
|
|
|
## Table of Contents |
|
- [ko-barTNumText(TNT Model๐งจ): Try Number To Korean Reading(์ซ์๋ฅผ ํ๊ธ๋ก ๋ฐ๊พธ๋ ๋ชจ๋ธ)](#ko-bartnumtexttnt-model-try-number-to-korean-reading์ซ์๋ฅผ-ํ๊ธ๋ก-๋ฐ๊พธ๋-๋ชจ๋ธ) |
|
- [Table of Contents](#table-of-contents) |
|
- [Model Details](#model-details) |
|
- [Uses](#uses) |
|
- [Evaluation](#evaluation) |
|
- [How to Get Started With the Model](#how-to-get-started-with-the-model) |
|
|
|
|
|
## Model Details |
|
- **Model Description:** |
|
๋ญ๊ฐ ์ฐพ์๋ด๋ ๋ชจ๋ธ์ด๋ ์๊ณ ๋ฆฌ์ฆ์ด ๋ฑํ ์์ด์ ๋ง๋ค์ด๋ณธ ๋ชจ๋ธ์
๋๋ค. <br /> |
|
BartForConditionalGeneration Fine-Tuning Model For Number To Korean <br /> |
|
BartForConditionalGeneration์ผ๋ก ํ์ธํ๋ํ, ์ซ์๋ฅผ ํ๊ธ๋ก ๋ณํํ๋ Task ์
๋๋ค. <br /> |
|
|
|
- Dataset use [Korea aihub](https://aihub.or.kr/aihubdata/data/list.do?currMenu=115&topMenu=100&srchDataRealmCode=REALM002&srchDataTy=DATA004) <br /> |
|
I can't open my fine-tuning datasets for my private issue <br /> |
|
๋ฐ์ดํฐ์
์ Korea aihub์์ ๋ฐ์์ ์ฌ์ฉํ์์ผ๋ฉฐ, ํ์ธํ๋์ ์ฌ์ฉ๋ ๋ชจ๋ ๋ฐ์ดํฐ๋ฅผ ์ฌ์ ์ ๊ณต๊ฐํด๋๋ฆด ์๋ ์์ต๋๋ค. <br /> |
|
|
|
- Korea aihub data is ONLY permit to Korean!!!!!!! <br /> |
|
aihub์์ ๋ฐ์ดํฐ๋ฅผ ๋ฐ์ผ์ค ๋ถ์ ํ๊ตญ์ธ์ผ ๊ฒ์ด๋ฏ๋ก, ํ๊ธ๋ก๋ง ์์ฑํฉ๋๋ค. <br /> |
|
์ ํํ๋ ์์ฑ์ ์ฌ๋ฅผ ์ฒ ์์ ์ฌ๋ก ๋ฒ์ญํ๋ ํํ๋ก ํ์ต๋ ๋ชจ๋ธ์
๋๋ค. (ETRI ์ ์ฌ๊ธฐ์ค) <br /> |
|
|
|
- In case, ten million, some people use 10 million or some people use 10000000, so this model is crucial for training datasets |
|
์ฒ๋ง์ 1000๋ง ํน์ 10000000์ผ๋ก ์ธ ์๋ ์๊ธฐ์, Training Datasets์ ๋ฐ๋ผ ๊ฒฐ๊ณผ๋ ์์ดํ ์ ์์ต๋๋ค. <br /> |
|
- **Developed by:** Yoo SungHyun(https://github.com/YooSungHyun) |
|
- **Language(s):** Korean |
|
- **License:** apache-2.0 |
|
- **Parent Model:** See the [kobart-base-v2](https://huggingface.co/gogamza/kobart-base-v2) for more information about the pre-trained base model. |
|
|
|
|
|
## Uses |
|
This Model is inferenced token BACKWARD. so, you have to `flip` before `tokenizer.decode()` |
|
ํด๋น ๋ชจ๋ธ์ inference์ ์ญ์์ผ๋ก ์์ธกํฉ๋๋ค. (๋ฐฅ์ 6์์ ๋จน์์ด -> ์ด ๋จน์ ์์ ์ฌ์ฏ ์ ๋ฐฅ) <br /> |
|
๋๋ฌธ์ `tokenizer.decode`๋ฅผ ์ํํ๊ธฐ ์ ์, `flip`์ผ๋ก ์ญ์์ผ๋ก ์นํํด์ฃผ์ธ์. |
|
|
|
Want see more detail follow this URL [KoGPT_num_converter](https://github.com/ddobokki/KoGPT_num_converter) <br /> and see `bart_inference.py` and `bart_train.py` |
|
```python |
|
class BartText2TextGenerationPipeline(Text2TextGenerationPipeline): |
|
def postprocess(self, model_outputs, return_type=ReturnType.TEXT, clean_up_tokenization_spaces=False): |
|
records = [] |
|
reversed_model_outputs = torch.flip(model_outputs["output_ids"][0], dims=[-1]) |
|
for output_ids in reversed_model_outputs: |
|
if return_type == ReturnType.TENSORS: |
|
record = {f"{self.return_name}_token_ids": output_ids} |
|
elif return_type == ReturnType.TEXT: |
|
record = { |
|
f"{self.return_name}_text": self.tokenizer.decode( |
|
output_ids, |
|
skip_special_tokens=True, |
|
clean_up_tokenization_spaces=clean_up_tokenization_spaces, |
|
) |
|
} |
|
records.append(record) |
|
return records |
|
``` |
|
## Evaluation |
|
Just using `evaluate-metric/bleu` and `evaluate-metric/rouge` in huggingface `evaluate` library |
|
## How to Get Started With the Model |
|
```python |
|
from transformers.pipelines import Text2TextGenerationPipeline |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
texts = ["๊ทธ๋ฌ๊ฒ ๋๊ฐ 6์๊น์ง ์ ์ ๋ง์๋?"] |
|
tokenizer = AutoTokenizer.from_pretrained( |
|
inference_args.model_name_or_path, |
|
) |
|
model = AutoModelForSeq2SeqLM.from_pretrained( |
|
inference_args.model_name_or_path, |
|
) |
|
# BartText2TextGenerationPipeline is implemented above (see 'Use') |
|
seq2seqlm_pipeline = BartText2TextGenerationPipeline(model=model, tokenizer=tokenizer) |
|
kwargs = { |
|
"min_length": args.min_length, |
|
"max_length": args.max_length, |
|
"num_beams": args.beam_width, |
|
"do_sample": args.do_sample, |
|
"num_beam_groups": args.num_beam_groups, |
|
} |
|
pred = seq2seqlm_pipeline(texts, **kwargs) |
|
print(pred) |
|
# ๊ทธ๋ฌ๊ฒ ๋๊ฐ ์ฌ์ฏ ์๊น์ง ์ ์ ๋ง์๋? |
|
``` |
|
|