File size: 7,233 Bytes
c756a2d 4a0e84a 48106ea 4a0e84a 4076e6a 87809f3 49a1efa 87809f3 4a0e84a 4076e6a 4a0e84a 4076e6a 4a0e84a 4076e6a 4a0e84a 4076e6a 4a0e84a 26a2681 4a0e84a 26a2681 4a0e84a fa795d8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
---
license: cc-by-nc-4.0
language:
- ko
- en
pipeline_tag: text-generation
---
# Mi:dm (**M**indful **I**ntelligence that **D**ialogs, Empathizes, Understands and **M**oves, λ―Ώ:μ)
Mi:dmμ KTκ° κ°λ°ν μ¬μ νμ΅ νκ΅μ΄-μμ΄ μΈμ΄λͺ¨λΈ μ
λλ€.
λ¬Έμμ΄μ μ
λ ₯μΌλ‘ νλ©°, λ¬Έμμ΄μ μμ±ν©λλ€.
Mi:dm is a pre-trained Korean-English language model developed by KT.
It takes text as input and creates text.
## Model Descriptions
### Midm-bitext-S (7B) Hyper Parameters
| Hyperparameter | Value |
|:---------------------|--------------:|
| \\(n_{layers}\\) | 32 |
| \\(d_{model}\\) | 4,096 |
| \\(d_{ff}\\) | 10,880 |
| \\(n_{heads}\\) | 32 |
| \\(d_{head}\\) | 128 |
| \\(n_{ctx}\\) | 2,048 |
| \\(n_{vocab}\\) | 72,154 |
| Positional Encoding | [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864) |
μ νλΌλ―Έν°λ‘ κ³μ°νλ©΄, λͺ¨λΈ λ‘λ©μλ μ½ 30GBμ GPU λ©λͺ¨λ¦¬κ° νμν©λλ€.
λͺ¨λΈ μΆλ‘ μλ μ
μΆλ ₯ ν ν° μμ λΉλ‘νμ¬ μΆκ° λ©λͺ¨λ¦¬κ° λ μμλ©λλ€.
### Architecture
Mi:dm μ Transformer ꡬ쑰λ₯Ό νμ©ν Auto-regressive Language Model μ
λλ€. μ μ λ Task μνμ μν΄ supervised fine-tuning (SFT) λμμ΅λλ€.
Mi:dm is a transformer based auto-regressive Language Model. It was supervised fine-tuned (SFT).
### Tokenizer
[google sentencepiece](https://github.com/google/sentencepiece) μ κΈ°λ°ν ν ν¬λμ΄μ λ₯Ό μ¬μ©νκ³ μμ΅λλ€. νκ΅μ΄ 볡ν©μ΄λ₯Ό κ³ λ €ν ννμ κΈ°λ° νμ΅μ νμμΌλ©° bi-lingual tokenization μ±λ₯ ν₯μμ μνμ¬ μμ΄ μ΄νλ₯Ό κ°μ΄ νμ΅νμμ΅λλ€.
Tokenizer was trained with [google sentencepiece](https://github.com/google/sentencepiece).
### Prompt Template
```
###System;{System}
###User;{User}
###Midm;
```
### Requirements
Mi:dmμ μ€ννκΈ° μν΄ νμν λΌμ΄λΈλ¬λ¦¬λ μλ pip λͺ
λ Ήμ΄λ₯Ό ν΅ν΄ μ€μΉν μ μμ΅λλ€.
To run Mi:dm, please make sure you meet the above requirements, and then execute the following pip commands to install the dependent libraries.
```bash
pip install transformers einops
```
### Usage
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
def main():
tokenizer = AutoTokenizer.from_pretrained(
"KT-AI/midm-bitext-S-7B-inst-v1",
trust_remote_code = True
)
model = AutoModelForCausalLM.from_pretrained(
"KT-AI/midm-bitext-S-7B-inst-v1",
trust_remote_code=True
)
model.cuda()
model.eval()
dummy_data = "###User;AIλ?\n###Midm;"
data = tokenizer(dummy_data, return_tensors="pt")
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
pred = model.generate(
input_ids=data.input_ids[..., :-1].cuda(),
streamer=streamer,
use_cache=True,
max_new_tokens=float('inf')
)
decoded_text = tokenizer.batch_decode(pred[0], skip_special_tokens=True)
if __name__ == "__main__":
main()
```
### Training Data
Mi:dm-bitext-S λͺ¨λΈμ νκ΅μ΄/μμ΄ κ³΅κ° λ°μ΄ν°λ₯Ό μ΄μ©νμ¬ μ¬μ νμ΅νμμ΅λλ€. λ―ΈμΈ μ‘°μ νμ΅μ μν΄μλ 곡κ°λμκ±°λ μ체 ꡬμΆν λ°μ΄ν°λ₯Ό μ΄μ©νμμΌλ©° μ΄λ₯Ό μΌλΆ κ°κ³΅νκ±°λ λ€μ μ μ νλ κ³Όμ μ κ±°μ³€μ΅λλ€.
KTλ κ³΅κ° λ°μ΄ν°λ₯Ό μ§μ μμ§νκ±°λ μ λ²ν μ¬μ© νκ° μ‘°κ±΄ νμ ν보νμμ΅λλ€. AI-HUB(https://www.aihub.or.kr/) μ λ§λμΉ λ°μ΄ν°μ κ΅λ¦½κ΅μ΄μ λͺ¨λμ λ§λμΉ λ°μ΄ν° (https://corpus.korean.go.kr/) λ₯Ό μ¬μ νμ΅ λ¨κ³μμ μ΄μ©νμμ΅λλ€.
KTκ° λ³΄μ ν κ³ κ° λ°μ΄ν°λ μ΄μ©νμ§ μμμ΅λλ€.
The Mi:dm-bitext-S model was pre-trained using Korean/English publicly available data. For fine-tuning, we used also publicly available data and went through some processing or refining.
KT collected public data directly or obtained it under legal permission conditions. The korean corpus data from AI-HUB (https://www.aihub.or.kr/) and the National Institute of Korean Language (https://corpus.korean.go.kr/) were used in the pre-training stage.
We did not use any customer data held by KT.
### Evaluation Results
TBA
## Limitations
KTλ Mi:dm νμ΅ λ°μ΄ν°μμ μμ€, λΉμμ΄, νΈκ²¬, μ°¨λ³ λ± λΉμ€λ¦¬μ ννμ μ κ±°νλ €κ³ λ
Έλ ₯νμμ΅λλ€.
κ·ΈλΌμλ λΆκ΅¬νκ³ μμ κ°μ λ°λμ§νμ§ μμ νν λλ λΆμ νν μ¬μ€μ΄ μμ±λ κ°λ₯μ±μ μμ ν μ κ±°νμ§ λͺ»νμμ΅λλ€.
λ³Έ λͺ¨λΈμ μ¬μ©νκΈ° μ μ΄λ¬ν νκ³λ₯Ό μΈμνκ³ μ¬λ°λ₯Έ μ¬μ©μ μν΄ νμν μ‘°μΉλ₯Ό μ·¨νλ κ²μ μ¬μ©μμ μ±
μμ΄λ©°, KTλ λ³Έ λͺ¨λΈμ νμ©μ΄ μΌκΈ°νλ μνμ΄λ μν΄μ λν΄ μ±
μμ μ§μ§ μμ΅λλ€.
Mi:dm νμ΅ λ°μ΄ν°μ λλΆλΆμ νκ΅μ΄μ μμ΄λ‘ ꡬμ±λμ΄ μμ΅λλ€. κ·Έ μΈ μΈμ΄μ λν μ΄ν΄μ μμ± κΈ°λ₯μ μ 곡νμ§ μμ΅λλ€.
We tried to remove unethical expressions such as profanity, slang, prejudice, and discrimination from training data.
Nevertheless, the possibility of creating undesirable and inaccurate expressions such as the above has not been completely eliminated.
It is the user's responsibility to be aware of these limitations before utilizing this model and take the necessary actions for proper use, and KT is not responsible for any risks or damages resulting from the use of this model.
Most of Mi:dm's training data consists of Korean and English.
## Licence
Mi:dm λͺ¨λΈ (Midm-bitext-S) μ CC-BY-NC 4.0 λΌμ΄μ μ€ νμ 곡κ°λμ΄ μμ΅λλ€.
μ¬μ©μλ λ³Έ λͺ¨λΈμ μΌλΆ νΉμ μ 체λ₯Ό μ¬νμ΅νκ±°λ μΌλΆλ§μ μ΄μ©νλ κ²μ΄ κ°λ₯ν©λλ€. λ€λ§ λ°λμ μ μμλ₯Ό νμνμ¬μΌ νλ©°, μ리 λͺ©μ μΌλ‘ μ΄μ©ν μ μμ΅λλ€. λν λ³Έ λͺ¨λΈμ μ¬λ°°ν¬νκ±°λ λ³Έ λͺ¨λΈμ 2μ°¨μ μ μλ¬Όμ μμ±νμ¬ κ³΅μ ν λλ λ³Έ λͺ¨λΈκ³Ό λμΌν CC-BY-NC 4.0 λΌμ΄μ μ€λ₯Ό μ μ©νμ¬μΌ ν©λλ€.
Mi:dm (Midm-bitext-S) is released under the CC-BY-NC 4.0 license.
Users can retrain part or all of this model or use only part of it. However, the author must be indicated and cannot be used for commercial purposes. Additionally, when sharing secondary works using this model, they must be distributed under the same CC-BY-NC 4.0 license.
## Citations
Mi:dmμ μ΄μ©ν 2μ°¨ μ μλ¬Όμ λ°°ν¬ν κ²½μ° μλ λ΄μ©μ μΈμ©νμ¬ μΆμ²λ₯Ό λͺ
μν΄μΌ ν©λλ€.
When distributing secondary works using Mi:dm, the source must be indicated by citing the content below.
```
@misc{kt-mi:dm,
title = {Mi:dm: KT Bilingual (Korean,English) Generative Pre-trained Transformer},
author = {KT},
year = {2023},
url = {https://huggingface.co/KT-AT/midm-bitext-S-7B-inst-v1}
howpublished = {\url{https://genielabs.ai}},
}
```
## Contacts
λ³Έ λͺ¨λΈμ λ€μν μ°κ΅¬ λͺ©μ μ νμ©κ³Ό κ°μ μ견μ κΈ°λ ν©λλ€. dschang@kt.com
We look forward to receiving any suggestions for improvement. dschang@kt.com
|