|
--- |
|
language: |
|
- ko |
|
pipeline_tag: text-generation |
|
--- |
|
# Model Card for Model ID |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
The model is a fined tuned version of a Korean Large Language model [KT-AI/midm-bitext-S-7B-inst-v1](https://huggingface.co/KT-AI/midm-bitext-S-7B-inst-v1). |
|
|
|
The purpose of the model is to analyze any "food order sentence" and extract information of product from the sentence. |
|
|
|
For example, let's assume an ordering sentence: |
|
``` |
|
์ฌ๊ธฐ์ ์ถ์ฒ๋ญ๊ฐ๋น 4์ธ๋ถํ๊ณ ์. ๋ผ๋ฉด์ฌ๋ฆฌ ์ถ๊ฐํ๊ฒ ์ต๋๋ค. ์ฝ๋ผ 300ml ๋์บ์ฃผ์ธ์. |
|
``` |
|
Then the model is expected to generate product informations like: |
|
``` |
|
- ๋ถ์ ๊ฒฐ๊ณผ 0: ์์๋ช
:์ถ์ฒ๋ญ๊ฐ๋น, ์๋:4์ธ๋ถ |
|
- ๋ถ์ ๊ฒฐ๊ณผ 1: ์์๋ช
:๋ผ๋ฉด์ฌ๋ฆฌ |
|
- ๋ถ์ ๊ฒฐ๊ณผ 2: ์์๋ช
:์ฝ๋ผ, ์ต์
:300ml, ์๋:๋์บ |
|
``` |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
- **Developed by:** [Jangmin Oh](https://huggingface.co/jangmin) |
|
- **Model type:** a Decoder-only Transformer |
|
- **Language(s) (NLP):** ko |
|
- **License:** You should keep the CC-BY-NC 4.0 form KT-AI. |
|
- **Finetuned from model:** [KT-AI/midm-bitext-S-7B-inst-v1](https://huggingface.co/KT-AI/midm-bitext-S-7B-inst-v1) |
|
|
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
The current model was developed using the GPT-4 API to generate a dataset for order sentences, and it has been fine-tuned on this dataset. Please note that we do not assume any responsibility for risks or damages caused by this model. |
|
|
|
## How to Get Started with the Model |
|
|
|
This is a simple example of usage of the model. |
|
If you want to load the fined-tuned model in INT4, please specify @load_in_4bit=True@ instead of @load_in_8bit=True@. |
|
|
|
``` python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer |
|
|
|
model_id = 'jangmin/merged-midm-7B-food-order-understanding-30K' |
|
|
|
prompt_template = """###System;{System} |
|
###User;{User} |
|
###Midm;""" |
|
|
|
default_system_msg = ( |
|
"๋๋ ๋จผ์ ์ฌ์ฉ์๊ฐ ์
๋ ฅํ ์ฃผ๋ฌธ ๋ฌธ์ฅ์ ๋ถ์ํ๋ ์์ด์ ํธ์ด๋ค. ์ด๋ก๋ถํฐ ์ฃผ๋ฌธ์ ๊ตฌ์ฑํ๋ ์์๋ช
, ์ต์
๋ช
, ์๋์ ์ฐจ๋ก๋๋ก ์ถ์ถํด์ผ ํ๋ค." |
|
) |
|
|
|
def wrapper_generate(model, tokenizer, input_prompt, do_stream=False): |
|
data = tokenizer(input_prompt, return_tensors="pt") |
|
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) |
|
input_ids = data.input_ids[..., :-1] |
|
with torch.no_grad(): |
|
pred = model.generate( |
|
input_ids=input_ids.cuda(), |
|
streamer=streamer if do_stream else None, |
|
use_cache=True, |
|
max_new_tokens=float('inf'), |
|
do_sample=False |
|
) |
|
decoded_text = tokenizer.batch_decode(pred, skip_special_tokens=True) |
|
decoded_text = decoded_text[0].replace("<[!newline]>", "\n") |
|
return (decoded_text[len(input_prompt):]) |
|
|
|
trained_model = AutoModelForCausalLM.from_pretrained( |
|
model_id, |
|
load_in_8bit=True, |
|
device_map="auto",, |
|
trust_remote_code=True, |
|
) |
|
|
|
tokenizer = AutoTokenizer.from_pretrained( |
|
model_id, |
|
trust_remote_code=True, |
|
) |
|
|
|
sentence = "์์ด์ค์๋ฉ๋ฆฌ์นด๋
ธ ํจ์ฌ์ด์ฆ ํ์ ํ๊ณ ์. ๋ธ๊ธฐ์ค๋ฌด๋ ํ์ ์ฃผ์ธ์. ๋, ์ฝ๋๋ธ๋ฃจ๋ผ๋ผ ํ๋์." |
|
analysis = wrapper_generate( |
|
model=trained_model, |
|
tokenizer=tokenizer, |
|
input_prompt=prompt_template.format(System=default_system_msg, User=sentence), |
|
do_stream=False |
|
) |
|
print(analysis) |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
The dataset was generated by GPT-4 API with a carefully designed prompt. A prompt template is desginged to generate examples of sentence pairs of a food order and its understanding. Total 30k examples were generated. Note that it cost about $400 to generate 30K examples through 3,000 API calls. |
|
|
|
Some generated examples are as follows: |
|
|
|
``` json |
|
{ |
|
'input': '๋ค์์ ๋งค์ฅ์์ ๊ณ ๊ฐ์ด ์์์ ์ฃผ๋ฌธํ๋ ์ฃผ๋ฌธ ๋ฌธ์ฅ์ด๋ค. ์ด๋ฅผ ๋ถ์ํ์ฌ ์์๋ช
, ์ต์
๋ช
, ์๋์ ์ถ์ถํ์ฌ ๊ณ ๊ฐ์ ์๋๋ฅผ ์ดํดํ๊ณ ์ ํ๋ค.\n๋ถ์ ๊ฒฐ๊ณผ๋ฅผ ์์ฑํด์ฃผ๊ธฐ ๋ฐ๋๋ค.\n\n### ๋ช
๋ น: ์ ์ก๋ณถ์ ํ๊ทธ๋ฆํ๊ณ ์, ๋น๋น๋ฐฅ ํ๊ทธ๋ฆ ์ถ๊ฐํด์ฃผ์ธ์. ### ์๋ต:\n', |
|
'output': '- ๋ถ์ ๊ฒฐ๊ณผ 0: ์์๋ช
:์ ์ก๋ณถ์,์๋:ํ๊ทธ๋ฆ\n- ๋ถ์ ๊ฒฐ๊ณผ 1: ์์๋ช
:๋น๋น๋ฐฅ,์๋:ํ๊ทธ๋ฆ' |
|
}, |
|
{ |
|
'input': '๋ค์์ ๋งค์ฅ์์ ๊ณ ๊ฐ์ด ์์์ ์ฃผ๋ฌธํ๋ ์ฃผ๋ฌธ ๋ฌธ์ฅ์ด๋ค. ์ด๋ฅผ ๋ถ์ํ์ฌ ์์๋ช
, ์ต์
๋ช
, ์๋์ ์ถ์ถํ์ฌ ๊ณ ๊ฐ์ ์๋๋ฅผ ์ดํดํ๊ณ ์ ํ๋ค.\n๋ถ์ ๊ฒฐ๊ณผ๋ฅผ ์์ฑํด์ฃผ๊ธฐ ๋ฐ๋๋ค.\n\n### ๋ช
๋ น: ์ฌ์ฒํ์์ก ๊ณฑ๋ฐฐ๊ธฐ ์ฃผ๋ฌธํ๊ณ ์, ์ค์ํฌ๋ฆผ์นํจ๋ ํ๋ ์ถ๊ฐํด์ฃผ์ธ์. ### ์๋ต:\n', |
|
'output': '- ๋ถ์ ๊ฒฐ๊ณผ 0: ์์๋ช
:์ฌ์ฒํ์์ก,์ต์
:๊ณฑ๋ฐฐ๊ธฐ\n- ๋ถ์ ๊ฒฐ๊ณผ 1: ์์๋ช
:์ค์ํฌ๋ฆผ์นํจ,์๋:ํ๋' |
|
} |
|
|
|
``` |
|
|
|
## Evaluation |
|
|
|
"The evaluation dataset comprises 3,004 examples, each consisting of a pair: a 'food-order sentence' and its corresponding 'analysis result' as a reference." |
|
|
|
The bleu scores on the dataset are as follows. |
|
|
|
| | llama-2 model | midm model | |
|
|---|---|---| |
|
| score | 93.323054 | 93.878258 | |
|
| counts | [81382, 76854, 72280, 67869] | [81616, 77246, 72840, 68586] | |
|
| totals | [84327, 81323, 78319, 75315] | [84376, 81372, 78368, 75364] | |
|
| precisions | [96.51, 94.5, 92.29, 90.11] | [96.73, 94.93, 92.95, 91.01] | |
|
| bp | 1.0 | 1.0 | |
|
| sys_len | 84327 | 84376 | |
|
| ref_len | 84124 | 84124 | |
|
|
|
llama-2 model referes the result of the [jangmin/merged-llama2-7b-chat-hf-food-order-understanding-30K], which was fine-tuned above llama-2-7b-chat-hf. |
|
|
|
## Note for Pretrained Model |
|
|
|
The citation of the pretrained model: |
|
|
|
``` |
|
@misc{kt-mi:dm, |
|
title = {Mi:dm: KT Bilingual (Korean,English) Generative Pre-trained Transformer}, |
|
author = {KT}, |
|
year = {2023}, |
|
url = {https://huggingface.co/KT-AT/midm-bitext-S-7B-inst-v1} |
|
howpublished = {\url{https://genielabs.ai}}, |
|
} |
|
``` |
|
|
|
## Model Card Authors |
|
|
|
Jangmin Oh |
|
|
|
## Model Card Contact |
|
|
|
Jangmin Oh |
|
|
|
|
|
|