File size: 6,018 Bytes
6055794
 
 
 
42155d0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
268c9c7
42155d0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
268c9c7
 
 
 
42155d0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
language:
- ko
pipeline_tag: text-generation
---
# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->

The model is a fined tuned version of a Korean Large Language model [KT-AI/midm-bitext-S-7B-inst-v1](https://huggingface.co/KT-AI/midm-bitext-S-7B-inst-v1).

The purpose of the model is to analyze any "food order sentence" and extract information of product from the sentence. 

For example, let's assume an ordering sentence:
```
์—ฌ๊ธฐ์š” ์ถ˜์ฒœ๋‹ญ๊ฐˆ๋น„ 4์ธ๋ถ„ํ•˜๊ณ ์š”. ๋ผ๋ฉด์‚ฌ๋ฆฌ ์ถ”๊ฐ€ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์ฝœ๋ผ 300ml ๋‘์บ”์ฃผ์„ธ์š”.
```
Then the model is expected to generate product informations like:
```
- ๋ถ„์„ ๊ฒฐ๊ณผ 0: ์Œ์‹๋ช…:์ถ˜์ฒœ๋‹ญ๊ฐˆ๋น„, ์ˆ˜๋Ÿ‰:4์ธ๋ถ„
- ๋ถ„์„ ๊ฒฐ๊ณผ 1: ์Œ์‹๋ช…:๋ผ๋ฉด์‚ฌ๋ฆฌ
- ๋ถ„์„ ๊ฒฐ๊ณผ 2: ์Œ์‹๋ช…:์ฝœ๋ผ, ์˜ต์…˜:300ml, ์ˆ˜๋Ÿ‰:๋‘์บ”
```

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

- **Developed by:** [Jangmin Oh](https://huggingface.co/jangmin)
- **Model type:** a Decoder-only Transformer
- **Language(s) (NLP):** ko
- **License:** You should keep the CC-BY-NC 4.0 form KT-AI.
- **Finetuned from model:** [KT-AI/midm-bitext-S-7B-inst-v1](https://huggingface.co/KT-AI/midm-bitext-S-7B-inst-v1)



## Bias, Risks, and Limitations

The current model was developed using the GPT-4 API to generate a dataset for order sentences, and it has been fine-tuned on this dataset. Please note that we do not assume any responsibility for risks or damages caused by this model.

## How to Get Started with the Model

This is a simple example of usage of the model.
If you want to load the fined-tuned model in INT4, please specify @load_in_4bit=True@ instead of @load_in_8bit=True@.

``` python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

model_id = 'jangmin/merged-midm-7B-food-order-understanding-30K'

prompt_template = """###System;{System}
###User;{User}
###Midm;"""

default_system_msg = (
    "๋„ˆ๋Š” ๋จผ์ € ์‚ฌ์šฉ์ž๊ฐ€ ์ž…๋ ฅํ•œ ์ฃผ๋ฌธ ๋ฌธ์žฅ์„ ๋ถ„์„ํ•˜๋Š” ์—์ด์ „ํŠธ์ด๋‹ค. ์ด๋กœ๋ถ€ํ„ฐ ์ฃผ๋ฌธ์„ ๊ตฌ์„ฑํ•˜๋Š” ์Œ์‹๋ช…, ์˜ต์…˜๋ช…, ์ˆ˜๋Ÿ‰์„ ์ฐจ๋ก€๋Œ€๋กœ ์ถ”์ถœํ•ด์•ผ ํ•œ๋‹ค."
)

def wrapper_generate(model, tokenizer, input_prompt, do_stream=False):
    data = tokenizer(input_prompt, return_tensors="pt")
    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
    input_ids = data.input_ids[..., :-1]
    with torch.no_grad():
        pred = model.generate(
            input_ids=input_ids.cuda(),
            streamer=streamer if do_stream else None,
            use_cache=True,
            max_new_tokens=float('inf'),
            do_sample=False
        )
    decoded_text = tokenizer.batch_decode(pred, skip_special_tokens=True)
    decoded_text = decoded_text[0].replace("<[!newline]>", "\n")
    return (decoded_text[len(input_prompt):])

trained_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_8bit=True,
    device_map="auto",,
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True,
)

sentence = "์•„์ด์Šค์•„๋ฉ”๋ฆฌ์นด๋…ธ ํ†จ์‚ฌ์ด์ฆˆ ํ•œ์ž” ํ•˜๊ณ ์š”. ๋”ธ๊ธฐ์Šค๋ฌด๋”” ํ•œ์ž” ์ฃผ์„ธ์š”. ๋˜, ์ฝœ๋“œ๋ธŒ๋ฃจ๋ผ๋–ผ ํ•˜๋‚˜์š”."
analysis = wrapper_generate(
    model=trained_model,
    tokenizer=tokenizer,
    input_prompt=prompt_template.format(System=default_system_msg, User=sentence),
    do_stream=False
)
print(analysis)
```

## Training Details

### Training Data

The dataset was generated by GPT-4 API with a carefully designed prompt. A prompt template is desginged to generate examples of sentence pairs of a food order and its understanding. Total 30k examples were generated. Note that it cost about $400 to generate 30K examples through 3,000 API calls.

Some generated examples are as follows:

``` json
{
  'input': '๋‹ค์Œ์€ ๋งค์žฅ์—์„œ ๊ณ ๊ฐ์ด ์Œ์‹์„ ์ฃผ๋ฌธํ•˜๋Š” ์ฃผ๋ฌธ ๋ฌธ์žฅ์ด๋‹ค. ์ด๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์Œ์‹๋ช…, ์˜ต์…˜๋ช…, ์ˆ˜๋Ÿ‰์„ ์ถ”์ถœํ•˜์—ฌ ๊ณ ๊ฐ์˜ ์˜๋„๋ฅผ ์ดํ•ดํ•˜๊ณ ์ž ํ•œ๋‹ค.\n๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ์™„์„ฑํ•ด์ฃผ๊ธฐ ๋ฐ”๋ž€๋‹ค.\n\n### ๋ช…๋ น: ์ œ์œก๋ณถ์Œ ํ•œ๊ทธ๋ฆ‡ํ•˜๊ณ ์š”, ๋น„๋น”๋ฐฅ ํ•œ๊ทธ๋ฆ‡ ์ถ”๊ฐ€ํ•ด์ฃผ์„ธ์š”. ### ์‘๋‹ต:\n',
  'output': '- ๋ถ„์„ ๊ฒฐ๊ณผ 0: ์Œ์‹๋ช…:์ œ์œก๋ณถ์Œ,์ˆ˜๋Ÿ‰:ํ•œ๊ทธ๋ฆ‡\n- ๋ถ„์„ ๊ฒฐ๊ณผ 1: ์Œ์‹๋ช…:๋น„๋น”๋ฐฅ,์ˆ˜๋Ÿ‰:ํ•œ๊ทธ๋ฆ‡'
},
{
  'input': '๋‹ค์Œ์€ ๋งค์žฅ์—์„œ ๊ณ ๊ฐ์ด ์Œ์‹์„ ์ฃผ๋ฌธํ•˜๋Š” ์ฃผ๋ฌธ ๋ฌธ์žฅ์ด๋‹ค. ์ด๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์Œ์‹๋ช…, ์˜ต์…˜๋ช…, ์ˆ˜๋Ÿ‰์„ ์ถ”์ถœํ•˜์—ฌ ๊ณ ๊ฐ์˜ ์˜๋„๋ฅผ ์ดํ•ดํ•˜๊ณ ์ž ํ•œ๋‹ค.\n๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ์™„์„ฑํ•ด์ฃผ๊ธฐ ๋ฐ”๋ž€๋‹ค.\n\n### ๋ช…๋ น: ์‚ฌ์ฒœํƒ•์ˆ˜์œก ๊ณฑ๋ฐฐ๊ธฐ ์ฃผ๋ฌธํ•˜๊ณ ์š”, ์ƒค์›Œํฌ๋ฆผ์น˜ํ‚จ๋„ ํ•˜๋‚˜ ์ถ”๊ฐ€ํ•ด์ฃผ์„ธ์š”. ### ์‘๋‹ต:\n',
  'output': '- ๋ถ„์„ ๊ฒฐ๊ณผ 0: ์Œ์‹๋ช…:์‚ฌ์ฒœํƒ•์ˆ˜์œก,์˜ต์…˜:๊ณฑ๋ฐฐ๊ธฐ\n- ๋ถ„์„ ๊ฒฐ๊ณผ 1: ์Œ์‹๋ช…:์ƒค์›Œํฌ๋ฆผ์น˜ํ‚จ,์ˆ˜๋Ÿ‰:ํ•˜๋‚˜'
}

```

## Evaluation

"The evaluation dataset comprises 3,004 examples, each consisting of a pair: a 'food-order sentence' and its corresponding 'analysis result' as a reference."

The bleu scores on the dataset are as follows. 

|  | llama-2 model | midm model |
|---|---|---|
| score | 93.323054 | 93.878258 |
| counts | [81382, 76854, 72280, 67869] | [81616, 77246, 72840, 68586] |
| totals | [84327, 81323, 78319, 75315] | [84376, 81372, 78368, 75364] |
| precisions | [96.51, 94.5, 92.29, 90.11] | [96.73, 94.93, 92.95, 91.01] |
| bp | 1.0 | 1.0 |
| sys_len | 84327 | 84376 |
| ref_len | 84124 | 84124 |

llama-2 model referes the result of the [jangmin/merged-llama2-7b-chat-hf-food-order-understanding-30K], which was fine-tuned above llama-2-7b-chat-hf.

## Note for Pretrained Model

The citation of the pretrained model:

```
@misc{kt-mi:dm,
  title         = {Mi:dm: KT Bilingual (Korean,English) Generative Pre-trained Transformer},
  author        = {KT},
  year          = {2023},
  url           = {https://huggingface.co/KT-AT/midm-bitext-S-7B-inst-v1}
  howpublished  = {\url{https://genielabs.ai}},
}
```

## Model Card Authors

Jangmin Oh

## Model Card Contact

Jangmin Oh