jangmin commited on
Commit
42155d0
โ€ข
1 Parent(s): 6055794

model card is written.

Browse files
Files changed (1) hide show
  1. README.md +154 -1
README.md CHANGED
@@ -2,4 +2,157 @@
2
  language:
3
  - ko
4
  pipeline_tag: text-generation
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  language:
3
  - ko
4
  pipeline_tag: text-generation
5
+ ---
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+ The model is a fined tuned version of a Korean Large Language model [KT-AI/midm-bitext-S-7B-inst-v1](https://huggingface.co/KT-AI/midm-bitext-S-7B-inst-v1).
11
+
12
+ The purpose of the model is to analyze any "food order sentence" and extract information of product from the sentence.
13
+
14
+ For example, let's assume an ordering sentence:
15
+ ```
16
+ ์—ฌ๊ธฐ์š” ์ถ˜์ฒœ๋‹ญ๊ฐˆ๋น„ 4์ธ๋ถ„ํ•˜๊ณ ์š”. ๋ผ๋ฉด์‚ฌ๋ฆฌ ์ถ”๊ฐ€ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์ฝœ๋ผ 300ml ๋‘์บ”์ฃผ์„ธ์š”.
17
+ ```
18
+ Then the model is expected to generate product informations like:
19
+ ```
20
+ - ๋ถ„์„ ๊ฒฐ๊ณผ 0: ์Œ์‹๋ช…:์ถ˜์ฒœ๋‹ญ๊ฐˆ๋น„, ์ˆ˜๋Ÿ‰:4์ธ๋ถ„
21
+ - ๋ถ„์„ ๊ฒฐ๊ณผ 1: ์Œ์‹๋ช…:๋ผ๋ฉด์‚ฌ๋ฆฌ
22
+ - ๋ถ„์„ ๊ฒฐ๊ณผ 2: ์Œ์‹๋ช…:์ฝœ๋ผ, ์˜ต์…˜:300ml, ์ˆ˜๋Ÿ‰:๋‘์บ”
23
+ ```
24
+
25
+ ## Model Details
26
+
27
+ ### Model Description
28
+
29
+ <!-- Provide a longer summary of what this model is. -->
30
+
31
+ - **Developed by:** [Jangmin Oh](https://huggingface.co/jangmin)
32
+ - **Model type:** a Decoder-only Transformer
33
+ - **Language(s) (NLP):** ko
34
+ - **License:** You should keep the CC-BY-NC 4.0 form KT-AI.
35
+ - **Finetuned from model:** [KT-AI/midm-bitext-S-7B-inst-v1](https://huggingface.co/KT-AI/midm-bitext-S-7B-inst-v1)
36
+
37
+
38
+
39
+ ## Bias, Risks, and Limitations
40
+
41
+ The current model was developed using the GPT-4 API to generate a dataset for order sentences, and it has been fine-tuned on this dataset. Please note that we do not assume any responsibility for risks or damages caused by this model.
42
+
43
+ ## How to Get Started with the Model
44
+
45
+ This is a simple example of usage of the model.
46
+
47
+ ``` python
48
+ import torch
49
+ from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
50
+
51
+ model_id = 'jangmin/merged-midm-7B-food-order-understanding-30K'
52
+
53
+ prompt_template = """###System;{System}
54
+ ###User;{User}
55
+ ###Midm;"""
56
+
57
+ default_system_msg = (
58
+ "๋„ˆ๋Š” ๋จผ์ € ์‚ฌ์šฉ์ž๊ฐ€ ์ž…๋ ฅํ•œ ์ฃผ๋ฌธ ๋ฌธ์žฅ์„ ๋ถ„์„ํ•˜๋Š” ์—์ด์ „ํŠธ์ด๋‹ค. ์ด๋กœ๋ถ€ํ„ฐ ์ฃผ๋ฌธ์„ ๊ตฌ์„ฑํ•˜๋Š” ์Œ์‹๋ช…, ์˜ต์…˜๋ช…, ์ˆ˜๋Ÿ‰์„ ์ฐจ๋ก€๋Œ€๋กœ ์ถ”์ถœํ•ด์•ผ ํ•œ๋‹ค."
59
+ )
60
+
61
+ def wrapper_generate(model, tokenizer, input_prompt, do_stream=False):
62
+ data = tokenizer(input_prompt, return_tensors="pt")
63
+ streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
64
+ input_ids = data.input_ids[..., :-1]
65
+ with torch.no_grad():
66
+ pred = model.generate(
67
+ input_ids=input_ids.cuda(),
68
+ streamer=streamer if do_stream else None,
69
+ use_cache=True,
70
+ max_new_tokens=float('inf'),
71
+ do_sample=False
72
+ )
73
+ decoded_text = tokenizer.batch_decode(pred, skip_special_tokens=True)
74
+ decoded_text = decoded_text[0].replace("<[!newline]>", "\n")
75
+ return (decoded_text[len(input_prompt):])
76
+
77
+ trained_model = AutoModelForCausalLM.from_pretrained(
78
+ model_id,
79
+ load_in_8bit=True,
80
+ device_map="auto",,
81
+ trust_remote_code=True,
82
+ )
83
+
84
+ tokenizer = AutoTokenizer.from_pretrained(
85
+ model_id,
86
+ trust_remote_code=True,
87
+ )
88
+
89
+ sentence = "์•„์ด์Šค์•„๋ฉ”๋ฆฌ์นด๋…ธ ํ†จ์‚ฌ์ด์ฆˆ ํ•œ์ž” ํ•˜๊ณ ์š”. ๋”ธ๊ธฐ์Šค๋ฌด๋”” ํ•œ์ž” ์ฃผ์„ธ์š”. ๋˜, ์ฝœ๋“œ๋ธŒ๋ฃจ๋ผ๋–ผ ํ•˜๋‚˜์š”."
90
+ analysis = wrapper_generate(
91
+ model=trained_model,
92
+ tokenizer=tokenier,
93
+ input_prompt=prompt_templateformat(System=default_system_msg, User=sentece, do_stream=True)
94
+ )
95
+ print(analysis)
96
+ ```
97
+
98
+ ## Training Details
99
+
100
+ ### Training Data
101
+
102
+ The dataset was generated by GPT-4 API with a carefully designed prompt. A prompt template is desginged to generate examples of sentence pairs of a food order and its understanding. Total 30k examples were generated. Note that it cost about $400 to generate 30K examples through 3,000 API calls.
103
+
104
+ Some generated examples are as follows:
105
+
106
+ ``` json
107
+ {
108
+ 'input': '๋‹ค์Œ์€ ๋งค์žฅ์—์„œ ๊ณ ๊ฐ์ด ์Œ์‹์„ ์ฃผ๋ฌธํ•˜๋Š” ์ฃผ๋ฌธ ๋ฌธ์žฅ์ด๋‹ค. ์ด๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์Œ์‹๋ช…, ์˜ต์…˜๋ช…, ์ˆ˜๋Ÿ‰์„ ์ถ”์ถœํ•˜์—ฌ ๊ณ ๊ฐ์˜ ์˜๋„๋ฅผ ์ดํ•ดํ•˜๊ณ ์ž ํ•œ๋‹ค.\n๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ์™„์„ฑํ•ด์ฃผ๊ธฐ ๋ฐ”๋ž€๋‹ค.\n\n### ๋ช…๋ น: ์ œ์œก๋ณถ์Œ ํ•œ๊ทธ๋ฆ‡ํ•˜๊ณ ์š”, ๋น„๋น”๋ฐฅ ํ•œ๊ทธ๋ฆ‡ ์ถ”๊ฐ€ํ•ด์ฃผ์„ธ์š”. ### ์‘๋‹ต:\n',
109
+ 'output': '- ๋ถ„์„ ๊ฒฐ๊ณผ 0: ์Œ์‹๋ช…:์ œ์œก๋ณถ์Œ,์ˆ˜๋Ÿ‰:ํ•œ๊ทธ๋ฆ‡\n- ๋ถ„์„ ๊ฒฐ๊ณผ 1: ์Œ์‹๋ช…:๋น„๋น”๋ฐฅ,์ˆ˜๋Ÿ‰:ํ•œ๊ทธ๋ฆ‡'
110
+ },
111
+ {
112
+ 'input': '๋‹ค์Œ์€ ๋งค์žฅ์—์„œ ๊ณ ๊ฐ์ด ์Œ์‹์„ ์ฃผ๋ฌธํ•˜๋Š” ์ฃผ๋ฌธ ๋ฌธ์žฅ์ด๋‹ค. ์ด๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์Œ์‹๋ช…, ์˜ต์…˜๋ช…, ์ˆ˜๋Ÿ‰์„ ์ถ”์ถœํ•˜์—ฌ ๊ณ ๊ฐ์˜ ์˜๋„๋ฅผ ์ดํ•ดํ•˜๊ณ ์ž ํ•œ๋‹ค.\n๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ์™„์„ฑํ•ด์ฃผ๊ธฐ ๋ฐ”๋ž€๋‹ค.\n\n### ๋ช…๋ น: ์‚ฌ์ฒœํƒ•์ˆ˜์œก ๊ณฑ๋ฐฐ๊ธฐ ์ฃผ๋ฌธํ•˜๊ณ ์š”, ์ƒค์›Œํฌ๋ฆผ์น˜ํ‚จ๋„ ํ•˜๋‚˜ ์ถ”๊ฐ€ํ•ด์ฃผ์„ธ์š”. ### ์‘๋‹ต:\n',
113
+ 'output': '- ๋ถ„์„ ๊ฒฐ๊ณผ 0: ์Œ์‹๋ช…:์‚ฌ์ฒœํƒ•์ˆ˜์œก,์˜ต์…˜:๊ณฑ๋ฐฐ๊ธฐ\n- ๋ถ„์„ ๊ฒฐ๊ณผ 1: ์Œ์‹๋ช…:์ƒค์›Œํฌ๋ฆผ์น˜ํ‚จ,์ˆ˜๋Ÿ‰:ํ•˜๋‚˜'
114
+ }
115
+
116
+ ```
117
+
118
+ ## Evaluation
119
+
120
+ "The evaluation dataset comprises 3,004 examples, each consisting of a pair: a 'food-order sentence' and its corresponding 'analysis result' as a reference."
121
+
122
+ The bleu scores on the dataset are as follows.
123
+
124
+ | | llama-2 model | midm model |
125
+ |---|---|---|
126
+ | score | 93.323054 | 93.878258 |
127
+ | counts | [81382, 76854, 72280, 67869] | [81616, 77246, 72840, 68586] |
128
+ | totals | [84327, 81323, 78319, 75315] | [84376, 81372, 78368, 75364] |
129
+ | precisions | [96.51, 94.5, 92.29, 90.11] | [96.73, 94.93, 92.95, 91.01] |
130
+ | bp | 1.0 | 1.0 |
131
+ | sys_len | 84327 | 84376 |
132
+ | ref_len | 84124 | 84124 |
133
+
134
+ llama-2 model referes the result of the [jangmin/merged-llama2-7b-chat-hf-food-order-understanding-30K], which was fine-tuned above llama-2-7b-chat-hf.
135
+
136
+ ## Note for Pretrained Model
137
+
138
+ The citation of the pretrained model:
139
+
140
+ ```
141
+ @misc{kt-mi:dm,
142
+ title = {Mi:dm: KT Bilingual (Korean,English) Generative Pre-trained Transformer},
143
+ author = {KT},
144
+ year = {2023},
145
+ url = {https://huggingface.co/KT-AT/midm-bitext-S-7B-inst-v1}
146
+ howpublished = {\url{https://genielabs.ai}},
147
+ }
148
+ ```
149
+
150
+ ## Model Card Authors
151
+
152
+ Jangmin Oh
153
+
154
+ ## Model Card Contact
155
+
156
+ Jangmin Oh
157
+
158
+