File size: 7,557 Bytes

9cdbee3
a9f1e24
 
 
 
 
7d05d99
a9f1e24
 
 
 
 
 
 
 
70652c1
a9f1e24
 
e2a78ee
 
a9f1e24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e2a78ee
 
e97ab07
e2a78ee
 
e97ab07
e2a78ee
 
 
 
 
 
 
 
 
 
 
 
 
9c20e00
e2a78ee
 
 
9c20e00
e2a78ee
 
 
4e2c89c
e2a78ee
4e2c89c
 
e2a78ee
 
 
 
 
 
 
380a458
9c20e00
 
e2a78ee
9c20e00
e2a78ee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26dbdb8
32f045f
e2a78ee
 
 
 
 
 
b3086d6
e2a78ee
 
b3086d6
e2a78ee

---
language:
- ko  # Example: fr
license: apache-2.0  # Example: apache-2.0 or any license from https://hf.co/docs/hub/repositories-licenses
library_name: transformers  # Optional. Example: keras or any library from https://github.com/huggingface/hub-docs/blob/main/js/src/lib/interfaces/Libraries.ts
tags:
- text2text-generation  # Example: audio
datasets:
- aihub  # Example: common_voice. Use dataset id from https://hf.co/datasets
metrics:
- bleu  # Example: wer. Use metric id from https://hf.co/metrics
- rouge

# Optional. Add this if you want to encode your eval results in a structured way.
model-index:
- name: ko-barTNumText
  results:
  - task:
      type: text2text-generation             # Required. Example: automatic-speech-recognition
      name: text2text-generation             # Optional. Example: Speech Recognition
    metrics:
      - type: bleu         # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 0.9161441917016176       # Required. Example: 20.90
        name: eval_bleu         # Optional. Example: Test WER
        verified: true              # Optional. If true, indicates that evaluation was generated by Hugging Face (vs. self-reported).
      - type: rouge1         # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 0.9502159661745533       # Required. Example: 20.90
        name: eval_rouge1         # Optional. Example: Test WER
        verified: true              # Optional. If true, indicates that evaluation was generated by Hugging Face (vs. self-reported).
      - type: rouge2         # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 0.9313935147887745       # Required. Example: 20.90
        name: eval_rouge2       # Optional. Example: Test WER
        verified: true              # Optional. If true, indicates that evaluation was generated by Hugging Face (vs. self-reported).
      - type: rougeL         # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 0.950015374196916       # Required. Example: 20.90
        name: eval_rougeL        # Optional. Example: Test WER
        verified: true              # Optional. If true, indicates that evaluation was generated by Hugging Face (vs. self-reported).
      - type: rougeLsum         # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 0.9500390902948073       # Required. Example: 20.90
        name: eval_rougeLsum        # Optional. Example: Test WER
        verified: true              # Optional. If true, indicates that evaluation was generated by Hugging Face (vs. self-reported).
---

# ko-barTNumText(TNT Model🧨): Try Number To Korean Reading(숫자를 한글로 바꾸는 모델)

## Table of Contents
- [ko-barTNumText(TNT Model🧨): Try Number To Korean Reading(숫자를 한글로 바꾸는 모델)](#ko-bartnumtexttnt-model-try-number-to-korean-reading숫자를-한글로-바꾸는-모델)
  - [Table of Contents](#table-of-contents)
  - [Model Details](#model-details)
  - [Uses](#uses)
  - [Evaluation](#evaluation)
  - [How to Get Started With the Model](#how-to-get-started-with-the-model)


## Model Details
- **Model Description:**
뭔가 찾아봐도 모델이나 알고리즘이 딱히 없어서 만들어본 모델입니다. <br />
BartForConditionalGeneration Fine-Tuning Model For Number To Korean <br />
BartForConditionalGeneration으로 파인튜닝한, 숫자를 한글로 변환하는 Task 입니다. <br />

- Dataset use [Korea aihub](https://aihub.or.kr/aihubdata/data/list.do?currMenu=115&topMenu=100&srchDataRealmCode=REALM002&srchDataTy=DATA004) <br />
I can't open my fine-tuning datasets for my private issue <br />
데이터셋은 Korea aihub에서 받아서 사용하였으며, 파인튜닝에 사용된 모든 데이터를 사정상 공개해드릴 수는 없습니다. <br />

- Korea aihub data is ONLY permit to Korean!!!!!!! <br />
aihub에서 데이터를 받으실 분은 한국인일 것이므로, 한글로만 작성합니다. <br />
정확히는 음성전사를 철자전사로 번역하는 형태로 학습된 모델입니다. (ETRI 전사기준) <br />

- In case, ten million, some people use 10 million or some people use 10000000, so this model is crucial for training datasets <br />
천만을 1000만 혹은 10000000으로 쓸 수도 있기에, Training Datasets에 따라 결과는 상이할 수 있습니다. <br />
**수관형사와 수 의존명사의 띄어쓰기에 따라 결과가 확연히 달라질 수 있습니다. (쉰살, 쉰 살 -> 쉰살, 50살)** https://eretz2.tistory.com/34 <br />
일단은 기준을 잡고 치우치게 학습시키기엔 어떻게 사용될지 몰라, 학습 데이터 분포에 맡기도록 했습니다. (쉰 살이 더 많을까 쉰살이 더 많을까!?)
- **Developed by:**  Yoo SungHyun(https://github.com/YooSungHyun)
- **Language(s):** Korean
- **License:** apache-2.0
- **Parent Model:** See the [kobart-base-v2](https://huggingface.co/gogamza/kobart-base-v2) for more information about the pre-trained base model.
  
  
## Uses
This Model is inferenced token BACKWARD. so, you have to `flip` before `tokenizer.decode()` <br />
해당 모델은 inference시 역순으로 예측합니다. (밥을 6시에 먹었어 -> 어 먹었 시에 여섯 을 밥) <br />
때문에 `tokenizer.decode`를 수행하기 전에, `flip`으로 역순으로 치환해주세요.

Want see more detail follow this URL [KoGPT_num_converter](https://github.com/ddobokki/KoGPT_num_converter) <br /> and see `bart_inference.py` and `bart_train.py`
```python
class BartText2TextGenerationPipeline(Text2TextGenerationPipeline):
    def postprocess(self, model_outputs, return_type=ReturnType.TEXT, clean_up_tokenization_spaces=False):
        records = []
        reversed_model_outputs = torch.flip(model_outputs["output_ids"][0], dims=[-1])
        for output_ids in reversed_model_outputs:
            if return_type == ReturnType.TENSORS:
                record = {f"{self.return_name}_token_ids": output_ids}
            elif return_type == ReturnType.TEXT:
                record = {
                    f"{self.return_name}_text": self.tokenizer.decode(
                        output_ids,
                        skip_special_tokens=True,
                        clean_up_tokenization_spaces=clean_up_tokenization_spaces,
                    )
                }
            records.append(record)
        return records
```
## Evaluation
Just using `evaluate-metric/bleu` and `evaluate-metric/rouge` in huggingface `evaluate` library <br />
[Training wanDB URL](https://wandb.ai/bart_tadev/BartForConditionalGeneration/runs/2dt1d2b0?workspace=user-bart_tadev)
## How to Get Started With the Model
```python
from transformers.pipelines import Text2TextGenerationPipeline
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
texts = ["그러게 누가 6시까지 술을 마시래?"]
tokenizer = AutoTokenizer.from_pretrained(
    args.model_name_or_path,
)
model = AutoModelForSeq2SeqLM.from_pretrained(
    args.model_name_or_path,
)
# BartText2TextGenerationPipeline is implemented above (see 'Use')
seq2seqlm_pipeline = BartText2TextGenerationPipeline(model=model, tokenizer=tokenizer)
kwargs = {
    "min_length": args.min_length,
    "max_length": args.max_length,
    "num_beams": args.beam_width,
    "do_sample": args.do_sample,
    "num_beam_groups": args.num_beam_groups,
}
pred = seq2seqlm_pipeline(texts, **kwargs)
print(pred)
# 그러게 누가 여섯 시까지 술을 마시래?
```