Edit model card

Mi:dm (Mindful Intelligence that Dialogs, Empathizes, Understands and Moves, 믿:음)

Mi:dm은 KTκ°€ κ°œλ°œν•œ μ‚¬μ „ν•™μŠ΅ ν•œκ΅­μ–΄-μ˜μ–΄ μ–Έμ–΄λͺ¨λΈ μž…λ‹ˆλ‹€. λ¬Έμžμ—΄μ„ μž…λ ₯으둜 ν•˜λ©°, λ¬Έμžμ—΄μ„ μƒμ„±ν•©λ‹ˆλ‹€.

Mi:dm is a pre-trained Korean-English language model developed by KT. It takes text as input and creates text.

Model Descriptions

Midm-bitext-S (7B) Hyper Parameters

Hyperparameter Value
nlayersn_{layers} 32
dmodeld_{model} 4,096
dffd_{ff} 10,880
nheadsn_{heads} 32
dheadd_{head} 128
nctxn_{ctx} 2,048
nvocabn_{vocab} 72,154
Positional Encoding Rotary Position Embedding (RoPE)

μœ„ νŒŒλΌλ―Έν„°λ‘œ κ³„μ‚°ν•˜λ©΄, λͺ¨λΈ λ‘œλ”©μ—λŠ” μ•½ 30GB의 GPU λ©”λͺ¨λ¦¬κ°€ ν•„μš”ν•©λ‹ˆλ‹€. λͺ¨λΈ μΆ”λ‘ μ—λŠ” μž…μΆœλ ₯ 토큰 μˆ˜μ— λΉ„λ‘€ν•˜μ—¬ μΆ”κ°€ λ©”λͺ¨λ¦¬κ°€ 더 μ†Œμš”λ©λ‹ˆλ‹€.

Architecture

Mi:dm 은 Transformer ꡬ쑰λ₯Ό ν™œμš©ν•œ Auto-regressive Language Model μž…λ‹ˆλ‹€. μ„ μ •λœ Task μˆ˜ν–‰μ„ μœ„ν•΄ supervised fine-tuning (SFT) λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

Mi:dm is a transformer based auto-regressive Language Model. It was supervised fine-tuned (SFT).

Tokenizer

google sentencepiece 에 κΈ°λ°˜ν•œ ν† ν¬λ‚˜μ΄μ €λ₯Ό μ‚¬μš©ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€. ν•œκ΅­μ–΄ 볡합어λ₯Ό κ³ λ €ν•œ ν˜•νƒœμ†Œ 기반 ν•™μŠ΅μ„ ν•˜μ˜€μœΌλ©° bi-lingual tokenization μ„±λŠ₯ ν–₯상을 μœ„ν•˜μ—¬ μ˜μ–΄ μ–΄νœ˜λ₯Ό 같이 ν•™μŠ΅ν•˜μ˜€μŠ΅λ‹ˆλ‹€.

Tokenizer was trained with google sentencepiece.

Prompt Template

###System;{System}
###User;{User}
###Midm;

Requirements

Mi:dm을 μ‹€ν–‰ν•˜κΈ° μœ„ν•΄ ν•„μš”ν•œ λΌμ΄λΈŒλŸ¬λ¦¬λŠ” μ•„λž˜ pip λͺ…λ Ήμ–΄λ₯Ό 톡해 μ„€μΉ˜ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

To run Mi:dm, please make sure you meet the above requirements, and then execute the following pip commands to install the dependent libraries.

pip install transformers einops

Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer 

def main():
    tokenizer = AutoTokenizer.from_pretrained(
        "KT-AI/midm-bitext-S-7B-inst-v1",
        trust_remote_code = True
    )

    model = AutoModelForCausalLM.from_pretrained(
        "KT-AI/midm-bitext-S-7B-inst-v1", 
        trust_remote_code=True
    )

    model.cuda()
    model.eval()

    dummy_data = "###User;AIλž€?\n###Midm;"
    data = tokenizer(dummy_data, return_tensors="pt")
    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
    pred = model.generate(
        input_ids=data.input_ids[..., :-1].cuda(),
        streamer=streamer,
        use_cache=True,
        max_new_tokens=float('inf')
    )
    decoded_text = tokenizer.batch_decode(pred[0], skip_special_tokens=True)

if __name__ == "__main__":

    main()

Training Data

Mi:dm-bitext-S λͺ¨λΈμ€ ν•œκ΅­μ–΄/μ˜μ–΄ 곡개 데이터λ₯Ό μ΄μš©ν•˜μ—¬ 사전 ν•™μŠ΅ν•˜μ˜€μŠ΅λ‹ˆλ‹€. λ―Έμ„Έ μ‘°μ • ν•™μŠ΅μ„ μœ„ν•΄μ„œλ„ κ³΅κ°œλ˜μ—ˆκ±°λ‚˜ 자체 κ΅¬μΆ•ν•œ 데이터λ₯Ό μ΄μš©ν•˜μ˜€μœΌλ©° 이λ₯Ό 일뢀 κ°€κ³΅ν•˜κ±°λ‚˜ λ‹€μ‹œ μ •μ œν•˜λŠ” 과정을 κ±°μ³€μŠ΅λ‹ˆλ‹€. KTλŠ” 곡개 데이터λ₯Ό 직접 μˆ˜μ§‘ν•˜κ±°λ‚˜ μ λ²•ν•œ μ‚¬μš© ν—ˆκ°€ 쑰건 ν•˜μ— ν™•λ³΄ν•˜μ˜€μŠ΅λ‹ˆλ‹€. AI-HUB(https://www.aihub.or.kr/) 의 λ§λ­‰μΉ˜ 데이터와 ꡭ립ꡭ어원 λͺ¨λ‘μ˜ λ§λ­‰μΉ˜ 데이터 (https://corpus.korean.go.kr/) λ₯Ό 사전 ν•™μŠ΅ λ‹¨κ³„μ—μ„œ μ΄μš©ν•˜μ˜€μŠ΅λ‹ˆλ‹€.

KTκ°€ λ³΄μœ ν•œ 고객 λ°μ΄ν„°λŠ” μ΄μš©ν•˜μ§€ μ•Šμ•˜μŠ΅λ‹ˆλ‹€.

The Mi:dm-bitext-S model was pre-trained using Korean/English publicly available data. For fine-tuning, we used also publicly available data and went through some processing or refining. KT collected public data directly or obtained it under legal permission conditions. The korean corpus data from AI-HUB (https://www.aihub.or.kr/) and the National Institute of Korean Language (https://corpus.korean.go.kr/) were used in the pre-training stage.

We did not use any customer data held by KT.

Evaluation Results

TBA

Limitations

KTλŠ” Mi:dm ν•™μŠ΅ λ°μ΄ν„°μ—μ„œ μš•μ„€, 비속어, 편견, 차별 λ“± λΉ„μœ€λ¦¬μ  ν‘œν˜„μ„ μ œκ±°ν•˜λ €κ³  λ…Έλ ₯ν•˜μ˜€μŠ΅λ‹ˆλ‹€. κ·ΈλŸΌμ—λ„ λΆˆκ΅¬ν•˜κ³  μœ„μ™€ 같은 λ°”λžŒμ§ν•˜μ§€ μ•Šμ€ ν‘œν˜„ λ˜λŠ” λΆ€μ •ν™•ν•œ 사싀이 생성될 κ°€λŠ₯성을 μ™„μ „νžˆ μ œκ±°ν•˜μ§€ λͺ»ν•˜μ˜€μŠ΅λ‹ˆλ‹€. λ³Έ λͺ¨λΈμ„ μ‚¬μš©ν•˜κΈ° μ „ μ΄λŸ¬ν•œ ν•œκ³„λ₯Ό μΈμ‹ν•˜κ³  μ˜¬λ°”λ₯Έ μ‚¬μš©μ„ μœ„ν•΄ ν•„μš”ν•œ 쑰치λ₯Ό μ·¨ν•˜λŠ” 것은 μ‚¬μš©μžμ˜ μ±…μž„μ΄λ©°, KTλŠ” λ³Έ λͺ¨λΈμ˜ ν™œμš©μ΄ μ•ΌκΈ°ν•˜λŠ” μœ„ν—˜μ΄λ‚˜ 손해에 λŒ€ν•΄ μ±…μž„μ„ 지지 μ•ŠμŠ΅λ‹ˆλ‹€.

Mi:dm ν•™μŠ΅ λ°μ΄ν„°μ˜ λŒ€λΆ€λΆ„μ€ ν•œκ΅­μ–΄μ™€ μ˜μ–΄λ‘œ κ΅¬μ„±λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€. κ·Έ μ™Έ 언어에 λŒ€ν•œ 이해와 생성 κΈ°λŠ₯은 μ œκ³΅ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.

We tried to remove unethical expressions such as profanity, slang, prejudice, and discrimination from training data. Nevertheless, the possibility of creating undesirable and inaccurate expressions such as the above has not been completely eliminated. It is the user's responsibility to be aware of these limitations before utilizing this model and take the necessary actions for proper use, and KT is not responsible for any risks or damages resulting from the use of this model.

Most of Mi:dm's training data consists of Korean and English.

Licence

Mi:dm λͺ¨λΈ (Midm-bitext-S) 은 CC-BY-NC 4.0 λΌμ΄μ„ μŠ€ ν•˜μ— κ³΅κ°œλ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€. μ‚¬μš©μžλŠ” λ³Έ λͺ¨λΈμ˜ 일뢀 ν˜Ήμ€ 전체λ₯Ό μž¬ν•™μŠ΅ν•˜κ±°λ‚˜ μΌλΆ€λ§Œμ„ μ΄μš©ν•˜λŠ” 것이 κ°€λŠ₯ν•©λ‹ˆλ‹€. λ‹€λ§Œ λ°˜λ“œμ‹œ μ €μž‘μžλ₯Ό ν‘œμ‹œν•˜μ—¬μ•Ό ν•˜λ©°, 영리 λͺ©μ μœΌλ‘œ μ΄μš©ν•  수 μ—†μŠ΅λ‹ˆλ‹€. λ˜ν•œ λ³Έ λͺ¨λΈμ„ μž¬λ°°ν¬ν•˜κ±°λ‚˜ λ³Έ λͺ¨λΈμ˜ 2μ°¨μ μ €μž‘λ¬Όμ„ μž‘μ„±ν•˜μ—¬ κ³΅μœ ν•  λ•ŒλŠ” λ³Έ λͺ¨λΈκ³Ό λ™μΌν•œ CC-BY-NC 4.0 λΌμ΄μ„ μŠ€λ₯Ό μ μš©ν•˜μ—¬μ•Ό ν•©λ‹ˆλ‹€.

Mi:dm (Midm-bitext-S) is released under the CC-BY-NC 4.0 license. Users can retrain part or all of this model or use only part of it. However, the author must be indicated and cannot be used for commercial purposes. Additionally, when sharing secondary works using this model, they must be distributed under the same CC-BY-NC 4.0 license.

Citations

Mi:dm을 μ΄μš©ν•œ 2μ°¨ μ €μž‘λ¬Όμ„ 배포할 경우 μ•„λž˜ λ‚΄μš©μ„ μΈμš©ν•˜μ—¬ 좜처λ₯Ό λͺ…μ‹œν•΄μ•Ό ν•©λ‹ˆλ‹€.

When distributing secondary works using Mi:dm, the source must be indicated by citing the content below.

@misc{kt-mi:dm,
  title         = {Mi:dm: KT Bilingual (Korean,English) Generative Pre-trained Transformer},
  author        = {KT},
  year          = {2023},
  url           = {https://huggingface.co/KT-AT/midm-bitext-S-7B-inst-v1}
  howpublished  = {\url{https://genielabs.ai}},
}

Contacts

λ³Έ λͺ¨λΈμ˜ λ‹€μ–‘ν•œ 연ꡬ λͺ©μ μ˜ ν™œμš©κ³Ό κ°œμ„  μ˜κ²¬μ„ κΈ°λŒ€ ν•©λ‹ˆλ‹€. dschang@kt.com

We look forward to receiving any suggestions for improvement. dschang@kt.com

Downloads last month
1,624
Safetensors
Model size
9.17B params
Tensor type
BF16
Β·
U8
Β·
Inference Examples
Inference API (serverless) does not yet support model repos that contain custom code.