Edit model card

Model Card for Model ID

The model is a fined tuned version of a Korean Large Language model KT-AI/midm-bitext-S-7B-inst-v1.

The purpose of the model is to analyze any "food order sentence" and extract information of product from the sentence.

For example, let's assume an ordering sentence:

์—ฌ๊ธฐ์š” ์ถ˜์ฒœ๋‹ญ๊ฐˆ๋น„ 4์ธ๋ถ„ํ•˜๊ณ ์š”. ๋ผ๋ฉด์‚ฌ๋ฆฌ ์ถ”๊ฐ€ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์ฝœ๋ผ 300ml ๋‘์บ”์ฃผ์„ธ์š”.

Then the model is expected to generate product informations like:

- ๋ถ„์„ ๊ฒฐ๊ณผ 0: ์Œ์‹๋ช…:์ถ˜์ฒœ๋‹ญ๊ฐˆ๋น„, ์ˆ˜๋Ÿ‰:4์ธ๋ถ„
- ๋ถ„์„ ๊ฒฐ๊ณผ 1: ์Œ์‹๋ช…:๋ผ๋ฉด์‚ฌ๋ฆฌ
- ๋ถ„์„ ๊ฒฐ๊ณผ 2: ์Œ์‹๋ช…:์ฝœ๋ผ, ์˜ต์…˜:300ml, ์ˆ˜๋Ÿ‰:๋‘์บ”

Model Details

Model Description

Bias, Risks, and Limitations

The current model was developed using the GPT-4 API to generate a dataset for order sentences, and it has been fine-tuned on this dataset. Please note that we do not assume any responsibility for risks or damages caused by this model.

How to Get Started with the Model

This is a simple example of usage of the model. If you want to load the fined-tuned model in INT4, please specify @load_in_4bit=True@ instead of @load_in_8bit=True@.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

model_id = 'jangmin/merged-midm-7B-food-order-understanding-30K'

prompt_template = """###System;{System}
###User;{User}
###Midm;"""

default_system_msg = (
    "๋„ˆ๋Š” ๋จผ์ € ์‚ฌ์šฉ์ž๊ฐ€ ์ž…๋ ฅํ•œ ์ฃผ๋ฌธ ๋ฌธ์žฅ์„ ๋ถ„์„ํ•˜๋Š” ์—์ด์ „ํŠธ์ด๋‹ค. ์ด๋กœ๋ถ€ํ„ฐ ์ฃผ๋ฌธ์„ ๊ตฌ์„ฑํ•˜๋Š” ์Œ์‹๋ช…, ์˜ต์…˜๋ช…, ์ˆ˜๋Ÿ‰์„ ์ฐจ๋ก€๋Œ€๋กœ ์ถ”์ถœํ•ด์•ผ ํ•œ๋‹ค."
)

def wrapper_generate(model, tokenizer, input_prompt, do_stream=False):
    data = tokenizer(input_prompt, return_tensors="pt")
    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
    input_ids = data.input_ids[..., :-1]
    with torch.no_grad():
        pred = model.generate(
            input_ids=input_ids.cuda(),
            streamer=streamer if do_stream else None,
            use_cache=True,
            max_new_tokens=float('inf'),
            do_sample=False
        )
    decoded_text = tokenizer.batch_decode(pred, skip_special_tokens=True)
    decoded_text = decoded_text[0].replace("<[!newline]>", "\n")
    return (decoded_text[len(input_prompt):])

trained_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_8bit=True,
    device_map="auto",,
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True,
)

sentence = "์•„์ด์Šค์•„๋ฉ”๋ฆฌ์นด๋…ธ ํ†จ์‚ฌ์ด์ฆˆ ํ•œ์ž” ํ•˜๊ณ ์š”. ๋”ธ๊ธฐ์Šค๋ฌด๋”” ํ•œ์ž” ์ฃผ์„ธ์š”. ๋˜, ์ฝœ๋“œ๋ธŒ๋ฃจ๋ผ๋–ผ ํ•˜๋‚˜์š”."
analysis = wrapper_generate(
    model=trained_model,
    tokenizer=tokenizer,
    input_prompt=prompt_template.format(System=default_system_msg, User=sentence),
    do_stream=False
)
print(analysis)

Training Details

Training Data

The dataset was generated by GPT-4 API with a carefully designed prompt. A prompt template is desginged to generate examples of sentence pairs of a food order and its understanding. Total 30k examples were generated. Note that it cost about $400 to generate 30K examples through 3,000 API calls.

Some generated examples are as follows:

{
  'input': '๋‹ค์Œ์€ ๋งค์žฅ์—์„œ ๊ณ ๊ฐ์ด ์Œ์‹์„ ์ฃผ๋ฌธํ•˜๋Š” ์ฃผ๋ฌธ ๋ฌธ์žฅ์ด๋‹ค. ์ด๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์Œ์‹๋ช…, ์˜ต์…˜๋ช…, ์ˆ˜๋Ÿ‰์„ ์ถ”์ถœํ•˜์—ฌ ๊ณ ๊ฐ์˜ ์˜๋„๋ฅผ ์ดํ•ดํ•˜๊ณ ์ž ํ•œ๋‹ค.\n๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ์™„์„ฑํ•ด์ฃผ๊ธฐ ๋ฐ”๋ž€๋‹ค.\n\n### ๋ช…๋ น: ์ œ์œก๋ณถ์Œ ํ•œ๊ทธ๋ฆ‡ํ•˜๊ณ ์š”, ๋น„๋น”๋ฐฅ ํ•œ๊ทธ๋ฆ‡ ์ถ”๊ฐ€ํ•ด์ฃผ์„ธ์š”. ### ์‘๋‹ต:\n',
  'output': '- ๋ถ„์„ ๊ฒฐ๊ณผ 0: ์Œ์‹๋ช…:์ œ์œก๋ณถ์Œ,์ˆ˜๋Ÿ‰:ํ•œ๊ทธ๋ฆ‡\n- ๋ถ„์„ ๊ฒฐ๊ณผ 1: ์Œ์‹๋ช…:๋น„๋น”๋ฐฅ,์ˆ˜๋Ÿ‰:ํ•œ๊ทธ๋ฆ‡'
},
{
  'input': '๋‹ค์Œ์€ ๋งค์žฅ์—์„œ ๊ณ ๊ฐ์ด ์Œ์‹์„ ์ฃผ๋ฌธํ•˜๋Š” ์ฃผ๋ฌธ ๋ฌธ์žฅ์ด๋‹ค. ์ด๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์Œ์‹๋ช…, ์˜ต์…˜๋ช…, ์ˆ˜๋Ÿ‰์„ ์ถ”์ถœํ•˜์—ฌ ๊ณ ๊ฐ์˜ ์˜๋„๋ฅผ ์ดํ•ดํ•˜๊ณ ์ž ํ•œ๋‹ค.\n๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ์™„์„ฑํ•ด์ฃผ๊ธฐ ๋ฐ”๋ž€๋‹ค.\n\n### ๋ช…๋ น: ์‚ฌ์ฒœํƒ•์ˆ˜์œก ๊ณฑ๋ฐฐ๊ธฐ ์ฃผ๋ฌธํ•˜๊ณ ์š”, ์ƒค์›Œํฌ๋ฆผ์น˜ํ‚จ๋„ ํ•˜๋‚˜ ์ถ”๊ฐ€ํ•ด์ฃผ์„ธ์š”. ### ์‘๋‹ต:\n',
  'output': '- ๋ถ„์„ ๊ฒฐ๊ณผ 0: ์Œ์‹๋ช…:์‚ฌ์ฒœํƒ•์ˆ˜์œก,์˜ต์…˜:๊ณฑ๋ฐฐ๊ธฐ\n- ๋ถ„์„ ๊ฒฐ๊ณผ 1: ์Œ์‹๋ช…:์ƒค์›Œํฌ๋ฆผ์น˜ํ‚จ,์ˆ˜๋Ÿ‰:ํ•˜๋‚˜'
}

Evaluation

"The evaluation dataset comprises 3,004 examples, each consisting of a pair: a 'food-order sentence' and its corresponding 'analysis result' as a reference."

The bleu scores on the dataset are as follows.

llama-2 model midm model
score 93.323054 93.878258
counts [81382, 76854, 72280, 67869] [81616, 77246, 72840, 68586]
totals [84327, 81323, 78319, 75315] [84376, 81372, 78368, 75364]
precisions [96.51, 94.5, 92.29, 90.11] [96.73, 94.93, 92.95, 91.01]
bp 1.0 1.0
sys_len 84327 84376
ref_len 84124 84124

llama-2 model referes the result of the [jangmin/merged-llama2-7b-chat-hf-food-order-understanding-30K], which was fine-tuned above llama-2-7b-chat-hf.

Note for Pretrained Model

The citation of the pretrained model:

@misc{kt-mi:dm,
  title         = {Mi:dm: KT Bilingual (Korean,English) Generative Pre-trained Transformer},
  author        = {KT},
  year          = {2023},
  url           = {https://huggingface.co/KT-AT/midm-bitext-S-7B-inst-v1}
  howpublished  = {\url{https://genielabs.ai}},
}

Model Card Authors

Jangmin Oh

Model Card Contact

Jangmin Oh

Downloads last month
1
Safetensors
Model size
9.17B params
Tensor type
BF16
ยท
F32
ยท
U8
ยท
Inference Examples
Inference API (serverless) does not yet support model repos that contain custom code.