Mi:dm (Mindful Intelligence that Dialogs, Empathizes, Understands and Moves, λ―Ώ:μ)
Mi:dmμ KTκ° κ°λ°ν μ¬μ νμ΅ νκ΅μ΄-μμ΄ μΈμ΄λͺ¨λΈ μ λλ€. λ¬Έμμ΄μ μ λ ₯μΌλ‘ νλ©°, λ¬Έμμ΄μ μμ±ν©λλ€.
Mi:dm is a pre-trained Korean-English language model developed by KT. It takes text as input and creates text.
Model Descriptions
Midm-bitext-S (7B) Hyper Parameters
Hyperparameter | Value |
---|---|
32 | |
4,096 | |
10,880 | |
32 | |
128 | |
2,048 | |
72,154 | |
Positional Encoding | Rotary Position Embedding (RoPE) |
μ νλΌλ―Έν°λ‘ κ³μ°νλ©΄, λͺ¨λΈ λ‘λ©μλ μ½ 30GBμ GPU λ©λͺ¨λ¦¬κ° νμν©λλ€. λͺ¨λΈ μΆλ‘ μλ μ μΆλ ₯ ν ν° μμ λΉλ‘νμ¬ μΆκ° λ©λͺ¨λ¦¬κ° λ μμλ©λλ€.
Architecture
Mi:dm μ Transformer ꡬ쑰λ₯Ό νμ©ν Auto-regressive Language Model μ λλ€. μ μ λ Task μνμ μν΄ supervised fine-tuning (SFT) λμμ΅λλ€.
Mi:dm is a transformer based auto-regressive Language Model. It was supervised fine-tuned (SFT).
Tokenizer
google sentencepiece μ κΈ°λ°ν ν ν¬λμ΄μ λ₯Ό μ¬μ©νκ³ μμ΅λλ€. νκ΅μ΄ 볡ν©μ΄λ₯Ό κ³ λ €ν ννμ κΈ°λ° νμ΅μ νμμΌλ©° bi-lingual tokenization μ±λ₯ ν₯μμ μνμ¬ μμ΄ μ΄νλ₯Ό κ°μ΄ νμ΅νμμ΅λλ€.
Tokenizer was trained with google sentencepiece.
Prompt Template
###System;{System}
###User;{User}
###Midm;
Requirements
Mi:dmμ μ€ννκΈ° μν΄ νμν λΌμ΄λΈλ¬λ¦¬λ μλ pip λͺ λ Ήμ΄λ₯Ό ν΅ν΄ μ€μΉν μ μμ΅λλ€.
To run Mi:dm, please make sure you meet the above requirements, and then execute the following pip commands to install the dependent libraries.
pip install transformers einops
Usage
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
def main():
tokenizer = AutoTokenizer.from_pretrained(
"KT-AI/midm-bitext-S-7B-inst-v1",
trust_remote_code = True
)
model = AutoModelForCausalLM.from_pretrained(
"KT-AI/midm-bitext-S-7B-inst-v1",
trust_remote_code=True
)
model.cuda()
model.eval()
dummy_data = "###User;AIλ?\n###Midm;"
data = tokenizer(dummy_data, return_tensors="pt")
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
pred = model.generate(
input_ids=data.input_ids[..., :-1].cuda(),
streamer=streamer,
use_cache=True,
max_new_tokens=float('inf')
)
decoded_text = tokenizer.batch_decode(pred[0], skip_special_tokens=True)
if __name__ == "__main__":
main()
Training Data
Mi:dm-bitext-S λͺ¨λΈμ νκ΅μ΄/μμ΄ κ³΅κ° λ°μ΄ν°λ₯Ό μ΄μ©νμ¬ μ¬μ νμ΅νμμ΅λλ€. λ―ΈμΈ μ‘°μ νμ΅μ μν΄μλ 곡κ°λμκ±°λ μ체 ꡬμΆν λ°μ΄ν°λ₯Ό μ΄μ©νμμΌλ©° μ΄λ₯Ό μΌλΆ κ°κ³΅νκ±°λ λ€μ μ μ νλ κ³Όμ μ κ±°μ³€μ΅λλ€. KTλ κ³΅κ° λ°μ΄ν°λ₯Ό μ§μ μμ§νκ±°λ μ λ²ν μ¬μ© νκ° μ‘°κ±΄ νμ ν보νμμ΅λλ€. AI-HUB(https://www.aihub.or.kr/) μ λ§λμΉ λ°μ΄ν°μ κ΅λ¦½κ΅μ΄μ λͺ¨λμ λ§λμΉ λ°μ΄ν° (https://corpus.korean.go.kr/) λ₯Ό μ¬μ νμ΅ λ¨κ³μμ μ΄μ©νμμ΅λλ€.
KTκ° λ³΄μ ν κ³ κ° λ°μ΄ν°λ μ΄μ©νμ§ μμμ΅λλ€.
The Mi:dm-bitext-S model was pre-trained using Korean/English publicly available data. For fine-tuning, we used also publicly available data and went through some processing or refining. KT collected public data directly or obtained it under legal permission conditions. The korean corpus data from AI-HUB (https://www.aihub.or.kr/) and the National Institute of Korean Language (https://corpus.korean.go.kr/) were used in the pre-training stage.
We did not use any customer data held by KT.
Evaluation Results
TBA
Limitations
KTλ Mi:dm νμ΅ λ°μ΄ν°μμ μμ€, λΉμμ΄, νΈκ²¬, μ°¨λ³ λ± λΉμ€λ¦¬μ ννμ μ κ±°νλ €κ³ λ Έλ ₯νμμ΅λλ€. κ·ΈλΌμλ λΆκ΅¬νκ³ μμ κ°μ λ°λμ§νμ§ μμ νν λλ λΆμ νν μ¬μ€μ΄ μμ±λ κ°λ₯μ±μ μμ ν μ κ±°νμ§ λͺ»νμμ΅λλ€. λ³Έ λͺ¨λΈμ μ¬μ©νκΈ° μ μ΄λ¬ν νκ³λ₯Ό μΈμνκ³ μ¬λ°λ₯Έ μ¬μ©μ μν΄ νμν μ‘°μΉλ₯Ό μ·¨νλ κ²μ μ¬μ©μμ μ± μμ΄λ©°, KTλ λ³Έ λͺ¨λΈμ νμ©μ΄ μΌκΈ°νλ μνμ΄λ μν΄μ λν΄ μ± μμ μ§μ§ μμ΅λλ€.
Mi:dm νμ΅ λ°μ΄ν°μ λλΆλΆμ νκ΅μ΄μ μμ΄λ‘ ꡬμ±λμ΄ μμ΅λλ€. κ·Έ μΈ μΈμ΄μ λν μ΄ν΄μ μμ± κΈ°λ₯μ μ 곡νμ§ μμ΅λλ€.
We tried to remove unethical expressions such as profanity, slang, prejudice, and discrimination from training data. Nevertheless, the possibility of creating undesirable and inaccurate expressions such as the above has not been completely eliminated. It is the user's responsibility to be aware of these limitations before utilizing this model and take the necessary actions for proper use, and KT is not responsible for any risks or damages resulting from the use of this model.
Most of Mi:dm's training data consists of Korean and English.
Licence
Mi:dm λͺ¨λΈ (Midm-bitext-S) μ CC-BY-NC 4.0 λΌμ΄μ μ€ νμ 곡κ°λμ΄ μμ΅λλ€. μ¬μ©μλ λ³Έ λͺ¨λΈμ μΌλΆ νΉμ μ 체λ₯Ό μ¬νμ΅νκ±°λ μΌλΆλ§μ μ΄μ©νλ κ²μ΄ κ°λ₯ν©λλ€. λ€λ§ λ°λμ μ μμλ₯Ό νμνμ¬μΌ νλ©°, μ리 λͺ©μ μΌλ‘ μ΄μ©ν μ μμ΅λλ€. λν λ³Έ λͺ¨λΈμ μ¬λ°°ν¬νκ±°λ λ³Έ λͺ¨λΈμ 2μ°¨μ μ μλ¬Όμ μμ±νμ¬ κ³΅μ ν λλ λ³Έ λͺ¨λΈκ³Ό λμΌν CC-BY-NC 4.0 λΌμ΄μ μ€λ₯Ό μ μ©νμ¬μΌ ν©λλ€.
Mi:dm (Midm-bitext-S) is released under the CC-BY-NC 4.0 license. Users can retrain part or all of this model or use only part of it. However, the author must be indicated and cannot be used for commercial purposes. Additionally, when sharing secondary works using this model, they must be distributed under the same CC-BY-NC 4.0 license.
Citations
Mi:dmμ μ΄μ©ν 2μ°¨ μ μλ¬Όμ λ°°ν¬ν κ²½μ° μλ λ΄μ©μ μΈμ©νμ¬ μΆμ²λ₯Ό λͺ μν΄μΌ ν©λλ€.
When distributing secondary works using Mi:dm, the source must be indicated by citing the content below.
@misc{kt-mi:dm,
title = {Mi:dm: KT Bilingual (Korean,English) Generative Pre-trained Transformer},
author = {KT},
year = {2023},
url = {https://huggingface.co/KT-AT/midm-bitext-S-7B-inst-v1}
howpublished = {\url{https://genielabs.ai}},
}
Contacts
λ³Έ λͺ¨λΈμ λ€μν μ°κ΅¬ λͺ©μ μ νμ©κ³Ό κ°μ μ견μ κΈ°λ ν©λλ€. dschang@kt.com
We look forward to receiving any suggestions for improvement. dschang@kt.com
- Downloads last month
- 1,624