model card is written.
Browse files
README.md
CHANGED
@@ -2,4 +2,157 @@
|
|
2 |
language:
|
3 |
- ko
|
4 |
pipeline_tag: text-generation
|
5 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
language:
|
3 |
- ko
|
4 |
pipeline_tag: text-generation
|
5 |
+
---
|
6 |
+
# Model Card for Model ID
|
7 |
+
|
8 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
9 |
+
|
10 |
+
The model is a fined tuned version of a Korean Large Language model [KT-AI/midm-bitext-S-7B-inst-v1](https://huggingface.co/KT-AI/midm-bitext-S-7B-inst-v1).
|
11 |
+
|
12 |
+
The purpose of the model is to analyze any "food order sentence" and extract information of product from the sentence.
|
13 |
+
|
14 |
+
For example, let's assume an ordering sentence:
|
15 |
+
```
|
16 |
+
์ฌ๊ธฐ์ ์ถ์ฒ๋ญ๊ฐ๋น 4์ธ๋ถํ๊ณ ์. ๋ผ๋ฉด์ฌ๋ฆฌ ์ถ๊ฐํ๊ฒ ์ต๋๋ค. ์ฝ๋ผ 300ml ๋์บ์ฃผ์ธ์.
|
17 |
+
```
|
18 |
+
Then the model is expected to generate product informations like:
|
19 |
+
```
|
20 |
+
- ๋ถ์ ๊ฒฐ๊ณผ 0: ์์๋ช
:์ถ์ฒ๋ญ๊ฐ๋น, ์๋:4์ธ๋ถ
|
21 |
+
- ๋ถ์ ๊ฒฐ๊ณผ 1: ์์๋ช
:๋ผ๋ฉด์ฌ๋ฆฌ
|
22 |
+
- ๋ถ์ ๊ฒฐ๊ณผ 2: ์์๋ช
:์ฝ๋ผ, ์ต์
:300ml, ์๋:๋์บ
|
23 |
+
```
|
24 |
+
|
25 |
+
## Model Details
|
26 |
+
|
27 |
+
### Model Description
|
28 |
+
|
29 |
+
<!-- Provide a longer summary of what this model is. -->
|
30 |
+
|
31 |
+
- **Developed by:** [Jangmin Oh](https://huggingface.co/jangmin)
|
32 |
+
- **Model type:** a Decoder-only Transformer
|
33 |
+
- **Language(s) (NLP):** ko
|
34 |
+
- **License:** You should keep the CC-BY-NC 4.0 form KT-AI.
|
35 |
+
- **Finetuned from model:** [KT-AI/midm-bitext-S-7B-inst-v1](https://huggingface.co/KT-AI/midm-bitext-S-7B-inst-v1)
|
36 |
+
|
37 |
+
|
38 |
+
|
39 |
+
## Bias, Risks, and Limitations
|
40 |
+
|
41 |
+
The current model was developed using the GPT-4 API to generate a dataset for order sentences, and it has been fine-tuned on this dataset. Please note that we do not assume any responsibility for risks or damages caused by this model.
|
42 |
+
|
43 |
+
## How to Get Started with the Model
|
44 |
+
|
45 |
+
This is a simple example of usage of the model.
|
46 |
+
|
47 |
+
``` python
|
48 |
+
import torch
|
49 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
|
50 |
+
|
51 |
+
model_id = 'jangmin/merged-midm-7B-food-order-understanding-30K'
|
52 |
+
|
53 |
+
prompt_template = """###System;{System}
|
54 |
+
###User;{User}
|
55 |
+
###Midm;"""
|
56 |
+
|
57 |
+
default_system_msg = (
|
58 |
+
"๋๋ ๋จผ์ ์ฌ์ฉ์๊ฐ ์
๋ ฅํ ์ฃผ๋ฌธ ๋ฌธ์ฅ์ ๋ถ์ํ๋ ์์ด์ ํธ์ด๋ค. ์ด๋ก๋ถํฐ ์ฃผ๋ฌธ์ ๊ตฌ์ฑํ๋ ์์๋ช
, ์ต์
๋ช
, ์๋์ ์ฐจ๋ก๋๋ก ์ถ์ถํด์ผ ํ๋ค."
|
59 |
+
)
|
60 |
+
|
61 |
+
def wrapper_generate(model, tokenizer, input_prompt, do_stream=False):
|
62 |
+
data = tokenizer(input_prompt, return_tensors="pt")
|
63 |
+
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
|
64 |
+
input_ids = data.input_ids[..., :-1]
|
65 |
+
with torch.no_grad():
|
66 |
+
pred = model.generate(
|
67 |
+
input_ids=input_ids.cuda(),
|
68 |
+
streamer=streamer if do_stream else None,
|
69 |
+
use_cache=True,
|
70 |
+
max_new_tokens=float('inf'),
|
71 |
+
do_sample=False
|
72 |
+
)
|
73 |
+
decoded_text = tokenizer.batch_decode(pred, skip_special_tokens=True)
|
74 |
+
decoded_text = decoded_text[0].replace("<[!newline]>", "\n")
|
75 |
+
return (decoded_text[len(input_prompt):])
|
76 |
+
|
77 |
+
trained_model = AutoModelForCausalLM.from_pretrained(
|
78 |
+
model_id,
|
79 |
+
load_in_8bit=True,
|
80 |
+
device_map="auto",,
|
81 |
+
trust_remote_code=True,
|
82 |
+
)
|
83 |
+
|
84 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
85 |
+
model_id,
|
86 |
+
trust_remote_code=True,
|
87 |
+
)
|
88 |
+
|
89 |
+
sentence = "์์ด์ค์๋ฉ๋ฆฌ์นด๋
ธ ํจ์ฌ์ด์ฆ ํ์ ํ๊ณ ์. ๋ธ๊ธฐ์ค๋ฌด๋ ํ์ ์ฃผ์ธ์. ๋, ์ฝ๋๋ธ๋ฃจ๋ผ๋ผ ํ๋์."
|
90 |
+
analysis = wrapper_generate(
|
91 |
+
model=trained_model,
|
92 |
+
tokenizer=tokenier,
|
93 |
+
input_prompt=prompt_templateformat(System=default_system_msg, User=sentece, do_stream=True)
|
94 |
+
)
|
95 |
+
print(analysis)
|
96 |
+
```
|
97 |
+
|
98 |
+
## Training Details
|
99 |
+
|
100 |
+
### Training Data
|
101 |
+
|
102 |
+
The dataset was generated by GPT-4 API with a carefully designed prompt. A prompt template is desginged to generate examples of sentence pairs of a food order and its understanding. Total 30k examples were generated. Note that it cost about $400 to generate 30K examples through 3,000 API calls.
|
103 |
+
|
104 |
+
Some generated examples are as follows:
|
105 |
+
|
106 |
+
``` json
|
107 |
+
{
|
108 |
+
'input': '๋ค์์ ๋งค์ฅ์์ ๊ณ ๊ฐ์ด ์์์ ์ฃผ๋ฌธํ๋ ์ฃผ๋ฌธ ๋ฌธ์ฅ์ด๋ค. ์ด๋ฅผ ๋ถ์ํ์ฌ ์์๋ช
, ์ต์
๋ช
, ์๋์ ์ถ์ถํ์ฌ ๊ณ ๊ฐ์ ์๋๋ฅผ ์ดํดํ๊ณ ์ ํ๋ค.\n๋ถ์ ๊ฒฐ๊ณผ๋ฅผ ์์ฑํด์ฃผ๊ธฐ ๋ฐ๋๋ค.\n\n### ๋ช
๋ น: ์ ์ก๋ณถ์ ํ๊ทธ๋ฆํ๊ณ ์, ๋น๋น๋ฐฅ ํ๊ทธ๋ฆ ์ถ๊ฐํด์ฃผ์ธ์. ### ์๋ต:\n',
|
109 |
+
'output': '- ๋ถ์ ๊ฒฐ๊ณผ 0: ์์๋ช
:์ ์ก๋ณถ์,์๋:ํ๊ทธ๋ฆ\n- ๋ถ์ ๊ฒฐ๊ณผ 1: ์์๋ช
:๋น๋น๋ฐฅ,์๋:ํ๊ทธ๋ฆ'
|
110 |
+
},
|
111 |
+
{
|
112 |
+
'input': '๋ค์์ ๋งค์ฅ์์ ๊ณ ๊ฐ์ด ์์์ ์ฃผ๋ฌธํ๋ ์ฃผ๋ฌธ ๋ฌธ์ฅ์ด๋ค. ์ด๋ฅผ ๋ถ์ํ์ฌ ์์๋ช
, ์ต์
๋ช
, ์๋์ ์ถ์ถํ์ฌ ๊ณ ๊ฐ์ ์๋๋ฅผ ์ดํดํ๊ณ ์ ํ๋ค.\n๋ถ์ ๊ฒฐ๊ณผ๋ฅผ ์์ฑํด์ฃผ๊ธฐ ๋ฐ๋๋ค.\n\n### ๋ช
๋ น: ์ฌ์ฒํ์์ก ๊ณฑ๋ฐฐ๊ธฐ ์ฃผ๋ฌธํ๊ณ ์, ์ค์ํฌ๋ฆผ์นํจ๋ ํ๋ ์ถ๊ฐํด์ฃผ์ธ์. ### ์๋ต:\n',
|
113 |
+
'output': '- ๋ถ์ ๊ฒฐ๊ณผ 0: ์์๋ช
:์ฌ์ฒํ์์ก,์ต์
:๊ณฑ๋ฐฐ๊ธฐ\n- ๋ถ์ ๊ฒฐ๊ณผ 1: ์์๋ช
:์ค์ํฌ๋ฆผ์นํจ,์๋:ํ๋'
|
114 |
+
}
|
115 |
+
|
116 |
+
```
|
117 |
+
|
118 |
+
## Evaluation
|
119 |
+
|
120 |
+
"The evaluation dataset comprises 3,004 examples, each consisting of a pair: a 'food-order sentence' and its corresponding 'analysis result' as a reference."
|
121 |
+
|
122 |
+
The bleu scores on the dataset are as follows.
|
123 |
+
|
124 |
+
| | llama-2 model | midm model |
|
125 |
+
|---|---|---|
|
126 |
+
| score | 93.323054 | 93.878258 |
|
127 |
+
| counts | [81382, 76854, 72280, 67869] | [81616, 77246, 72840, 68586] |
|
128 |
+
| totals | [84327, 81323, 78319, 75315] | [84376, 81372, 78368, 75364] |
|
129 |
+
| precisions | [96.51, 94.5, 92.29, 90.11] | [96.73, 94.93, 92.95, 91.01] |
|
130 |
+
| bp | 1.0 | 1.0 |
|
131 |
+
| sys_len | 84327 | 84376 |
|
132 |
+
| ref_len | 84124 | 84124 |
|
133 |
+
|
134 |
+
llama-2 model referes the result of the [jangmin/merged-llama2-7b-chat-hf-food-order-understanding-30K], which was fine-tuned above llama-2-7b-chat-hf.
|
135 |
+
|
136 |
+
## Note for Pretrained Model
|
137 |
+
|
138 |
+
The citation of the pretrained model:
|
139 |
+
|
140 |
+
```
|
141 |
+
@misc{kt-mi:dm,
|
142 |
+
title = {Mi:dm: KT Bilingual (Korean,English) Generative Pre-trained Transformer},
|
143 |
+
author = {KT},
|
144 |
+
year = {2023},
|
145 |
+
url = {https://huggingface.co/KT-AT/midm-bitext-S-7B-inst-v1}
|
146 |
+
howpublished = {\url{https://genielabs.ai}},
|
147 |
+
}
|
148 |
+
```
|
149 |
+
|
150 |
+
## Model Card Authors
|
151 |
+
|
152 |
+
Jangmin Oh
|
153 |
+
|
154 |
+
## Model Card Contact
|
155 |
+
|
156 |
+
Jangmin Oh
|
157 |
+
|
158 |
+
|