File size: 6,018 Bytes
6055794 42155d0 268c9c7 42155d0 268c9c7 42155d0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
---
language:
- ko
pipeline_tag: text-generation
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
The model is a fined tuned version of a Korean Large Language model [KT-AI/midm-bitext-S-7B-inst-v1](https://huggingface.co/KT-AI/midm-bitext-S-7B-inst-v1).
The purpose of the model is to analyze any "food order sentence" and extract information of product from the sentence.
For example, let's assume an ordering sentence:
```
์ฌ๊ธฐ์ ์ถ์ฒ๋ญ๊ฐ๋น 4์ธ๋ถํ๊ณ ์. ๋ผ๋ฉด์ฌ๋ฆฌ ์ถ๊ฐํ๊ฒ ์ต๋๋ค. ์ฝ๋ผ 300ml ๋์บ์ฃผ์ธ์.
```
Then the model is expected to generate product informations like:
```
- ๋ถ์ ๊ฒฐ๊ณผ 0: ์์๋ช
:์ถ์ฒ๋ญ๊ฐ๋น, ์๋:4์ธ๋ถ
- ๋ถ์ ๊ฒฐ๊ณผ 1: ์์๋ช
:๋ผ๋ฉด์ฌ๋ฆฌ
- ๋ถ์ ๊ฒฐ๊ณผ 2: ์์๋ช
:์ฝ๋ผ, ์ต์
:300ml, ์๋:๋์บ
```
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
- **Developed by:** [Jangmin Oh](https://huggingface.co/jangmin)
- **Model type:** a Decoder-only Transformer
- **Language(s) (NLP):** ko
- **License:** You should keep the CC-BY-NC 4.0 form KT-AI.
- **Finetuned from model:** [KT-AI/midm-bitext-S-7B-inst-v1](https://huggingface.co/KT-AI/midm-bitext-S-7B-inst-v1)
## Bias, Risks, and Limitations
The current model was developed using the GPT-4 API to generate a dataset for order sentences, and it has been fine-tuned on this dataset. Please note that we do not assume any responsibility for risks or damages caused by this model.
## How to Get Started with the Model
This is a simple example of usage of the model.
If you want to load the fined-tuned model in INT4, please specify @load_in_4bit=True@ instead of @load_in_8bit=True@.
``` python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
model_id = 'jangmin/merged-midm-7B-food-order-understanding-30K'
prompt_template = """###System;{System}
###User;{User}
###Midm;"""
default_system_msg = (
"๋๋ ๋จผ์ ์ฌ์ฉ์๊ฐ ์
๋ ฅํ ์ฃผ๋ฌธ ๋ฌธ์ฅ์ ๋ถ์ํ๋ ์์ด์ ํธ์ด๋ค. ์ด๋ก๋ถํฐ ์ฃผ๋ฌธ์ ๊ตฌ์ฑํ๋ ์์๋ช
, ์ต์
๋ช
, ์๋์ ์ฐจ๋ก๋๋ก ์ถ์ถํด์ผ ํ๋ค."
)
def wrapper_generate(model, tokenizer, input_prompt, do_stream=False):
data = tokenizer(input_prompt, return_tensors="pt")
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
input_ids = data.input_ids[..., :-1]
with torch.no_grad():
pred = model.generate(
input_ids=input_ids.cuda(),
streamer=streamer if do_stream else None,
use_cache=True,
max_new_tokens=float('inf'),
do_sample=False
)
decoded_text = tokenizer.batch_decode(pred, skip_special_tokens=True)
decoded_text = decoded_text[0].replace("<[!newline]>", "\n")
return (decoded_text[len(input_prompt):])
trained_model = AutoModelForCausalLM.from_pretrained(
model_id,
load_in_8bit=True,
device_map="auto",,
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
model_id,
trust_remote_code=True,
)
sentence = "์์ด์ค์๋ฉ๋ฆฌ์นด๋
ธ ํจ์ฌ์ด์ฆ ํ์ ํ๊ณ ์. ๋ธ๊ธฐ์ค๋ฌด๋ ํ์ ์ฃผ์ธ์. ๋, ์ฝ๋๋ธ๋ฃจ๋ผ๋ผ ํ๋์."
analysis = wrapper_generate(
model=trained_model,
tokenizer=tokenizer,
input_prompt=prompt_template.format(System=default_system_msg, User=sentence),
do_stream=False
)
print(analysis)
```
## Training Details
### Training Data
The dataset was generated by GPT-4 API with a carefully designed prompt. A prompt template is desginged to generate examples of sentence pairs of a food order and its understanding. Total 30k examples were generated. Note that it cost about $400 to generate 30K examples through 3,000 API calls.
Some generated examples are as follows:
``` json
{
'input': '๋ค์์ ๋งค์ฅ์์ ๊ณ ๊ฐ์ด ์์์ ์ฃผ๋ฌธํ๋ ์ฃผ๋ฌธ ๋ฌธ์ฅ์ด๋ค. ์ด๋ฅผ ๋ถ์ํ์ฌ ์์๋ช
, ์ต์
๋ช
, ์๋์ ์ถ์ถํ์ฌ ๊ณ ๊ฐ์ ์๋๋ฅผ ์ดํดํ๊ณ ์ ํ๋ค.\n๋ถ์ ๊ฒฐ๊ณผ๋ฅผ ์์ฑํด์ฃผ๊ธฐ ๋ฐ๋๋ค.\n\n### ๋ช
๋ น: ์ ์ก๋ณถ์ ํ๊ทธ๋ฆํ๊ณ ์, ๋น๋น๋ฐฅ ํ๊ทธ๋ฆ ์ถ๊ฐํด์ฃผ์ธ์. ### ์๋ต:\n',
'output': '- ๋ถ์ ๊ฒฐ๊ณผ 0: ์์๋ช
:์ ์ก๋ณถ์,์๋:ํ๊ทธ๋ฆ\n- ๋ถ์ ๊ฒฐ๊ณผ 1: ์์๋ช
:๋น๋น๋ฐฅ,์๋:ํ๊ทธ๋ฆ'
},
{
'input': '๋ค์์ ๋งค์ฅ์์ ๊ณ ๊ฐ์ด ์์์ ์ฃผ๋ฌธํ๋ ์ฃผ๋ฌธ ๋ฌธ์ฅ์ด๋ค. ์ด๋ฅผ ๋ถ์ํ์ฌ ์์๋ช
, ์ต์
๋ช
, ์๋์ ์ถ์ถํ์ฌ ๊ณ ๊ฐ์ ์๋๋ฅผ ์ดํดํ๊ณ ์ ํ๋ค.\n๋ถ์ ๊ฒฐ๊ณผ๋ฅผ ์์ฑํด์ฃผ๊ธฐ ๋ฐ๋๋ค.\n\n### ๋ช
๋ น: ์ฌ์ฒํ์์ก ๊ณฑ๋ฐฐ๊ธฐ ์ฃผ๋ฌธํ๊ณ ์, ์ค์ํฌ๋ฆผ์นํจ๋ ํ๋ ์ถ๊ฐํด์ฃผ์ธ์. ### ์๋ต:\n',
'output': '- ๋ถ์ ๊ฒฐ๊ณผ 0: ์์๋ช
:์ฌ์ฒํ์์ก,์ต์
:๊ณฑ๋ฐฐ๊ธฐ\n- ๋ถ์ ๊ฒฐ๊ณผ 1: ์์๋ช
:์ค์ํฌ๋ฆผ์นํจ,์๋:ํ๋'
}
```
## Evaluation
"The evaluation dataset comprises 3,004 examples, each consisting of a pair: a 'food-order sentence' and its corresponding 'analysis result' as a reference."
The bleu scores on the dataset are as follows.
| | llama-2 model | midm model |
|---|---|---|
| score | 93.323054 | 93.878258 |
| counts | [81382, 76854, 72280, 67869] | [81616, 77246, 72840, 68586] |
| totals | [84327, 81323, 78319, 75315] | [84376, 81372, 78368, 75364] |
| precisions | [96.51, 94.5, 92.29, 90.11] | [96.73, 94.93, 92.95, 91.01] |
| bp | 1.0 | 1.0 |
| sys_len | 84327 | 84376 |
| ref_len | 84124 | 84124 |
llama-2 model referes the result of the [jangmin/merged-llama2-7b-chat-hf-food-order-understanding-30K], which was fine-tuned above llama-2-7b-chat-hf.
## Note for Pretrained Model
The citation of the pretrained model:
```
@misc{kt-mi:dm,
title = {Mi:dm: KT Bilingual (Korean,English) Generative Pre-trained Transformer},
author = {KT},
year = {2023},
url = {https://huggingface.co/KT-AT/midm-bitext-S-7B-inst-v1}
howpublished = {\url{https://genielabs.ai}},
}
```
## Model Card Authors
Jangmin Oh
## Model Card Contact
Jangmin Oh
|