File size: 7,233 Bytes
c756a2d
 
 
 
 
 
 
4a0e84a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48106ea
 
4a0e84a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4076e6a
 
 
 
 
 
 
 
87809f3
 
 
 
 
 
 
 
49a1efa
87809f3
 
 
4a0e84a
 
 
 
 
 
 
4076e6a
4a0e84a
 
4076e6a
4a0e84a
4076e6a
4a0e84a
 
4076e6a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4a0e84a
 
 
 
 
 
 
 
26a2681
 
4a0e84a
 
 
 
26a2681
 
4a0e84a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fa795d8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
license: cc-by-nc-4.0
language:
- ko
- en
pipeline_tag: text-generation
---
# Mi:dm (**M**indful **I**ntelligence that **D**ialogs, Empathizes, Understands and **M**oves, 믿:음) 

Mi:dm은 KTκ°€ κ°œλ°œν•œ μ‚¬μ „ν•™μŠ΅ ν•œκ΅­μ–΄-μ˜μ–΄ μ–Έμ–΄λͺ¨λΈ μž…λ‹ˆλ‹€. 
λ¬Έμžμ—΄μ„ μž…λ ₯으둜 ν•˜λ©°, λ¬Έμžμ—΄μ„ μƒμ„±ν•©λ‹ˆλ‹€. 

Mi:dm is a pre-trained Korean-English language model developed by KT.
It takes text as input and creates text.


## Model Descriptions

### Midm-bitext-S (7B) Hyper Parameters

| Hyperparameter       | Value         |
|:---------------------|--------------:|
| \\(n_{layers}\\)     | 32            |
| \\(d_{model}\\)      | 4,096         |
| \\(d_{ff}\\)         | 10,880        |
| \\(n_{heads}\\)      | 32            |
| \\(d_{head}\\)       | 128           |
| \\(n_{ctx}\\)        | 2,048         |
| \\(n_{vocab}\\)      | 72,154        |
| Positional Encoding  | [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864) |

μœ„ νŒŒλΌλ―Έν„°λ‘œ κ³„μ‚°ν•˜λ©΄, λͺ¨λΈ λ‘œλ”©μ—λŠ” μ•½ 30GB의 GPU λ©”λͺ¨λ¦¬κ°€ ν•„μš”ν•©λ‹ˆλ‹€.
λͺ¨λΈ μΆ”λ‘ μ—λŠ” μž…μΆœλ ₯ 토큰 μˆ˜μ— λΉ„λ‘€ν•˜μ—¬ μΆ”κ°€ λ©”λͺ¨λ¦¬κ°€ 더 μ†Œμš”λ©λ‹ˆλ‹€.

### Architecture

Mi:dm 은 Transformer ꡬ쑰λ₯Ό ν™œμš©ν•œ Auto-regressive Language Model μž…λ‹ˆλ‹€. μ„ μ •λœ Task μˆ˜ν–‰μ„ μœ„ν•΄ supervised fine-tuning (SFT) λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

Mi:dm is a transformer based auto-regressive Language Model. It was supervised fine-tuned (SFT).


### Tokenizer

[google sentencepiece](https://github.com/google/sentencepiece) 에 κΈ°λ°˜ν•œ ν† ν¬λ‚˜μ΄μ €λ₯Ό μ‚¬μš©ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€. ν•œκ΅­μ–΄ 볡합어λ₯Ό κ³ λ €ν•œ ν˜•νƒœμ†Œ 기반 ν•™μŠ΅μ„ ν•˜μ˜€μœΌλ©° bi-lingual tokenization μ„±λŠ₯ ν–₯상을 μœ„ν•˜μ—¬ μ˜μ–΄ μ–΄νœ˜λ₯Ό 같이 ν•™μŠ΅ν•˜μ˜€μŠ΅λ‹ˆλ‹€. 

Tokenizer was trained with [google sentencepiece](https://github.com/google/sentencepiece).


### Prompt Template

```
###System;{System}
###User;{User}
###Midm;
```


### Requirements

Mi:dm을 μ‹€ν–‰ν•˜κΈ° μœ„ν•΄ ν•„μš”ν•œ λΌμ΄λΈŒλŸ¬λ¦¬λŠ” μ•„λž˜ pip λͺ…λ Ήμ–΄λ₯Ό 톡해 μ„€μΉ˜ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

To run Mi:dm, please make sure you meet the above requirements, and then execute the following pip commands to install the dependent libraries.

```bash
pip install transformers einops
```


### Usage 

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer 

def main():
    tokenizer = AutoTokenizer.from_pretrained(
        "KT-AI/midm-bitext-S-7B-inst-v1",
        trust_remote_code = True
    )

    model = AutoModelForCausalLM.from_pretrained(
        "KT-AI/midm-bitext-S-7B-inst-v1", 
        trust_remote_code=True
    )

    model.cuda()
    model.eval()

    dummy_data = "###User;AIλž€?\n###Midm;"
    data = tokenizer(dummy_data, return_tensors="pt")
    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
    pred = model.generate(
        input_ids=data.input_ids[..., :-1].cuda(),
        streamer=streamer,
        use_cache=True,
        max_new_tokens=float('inf')
    )
    decoded_text = tokenizer.batch_decode(pred[0], skip_special_tokens=True)

if __name__ == "__main__":

    main()
```

### Training Data

Mi:dm-bitext-S λͺ¨λΈμ€ ν•œκ΅­μ–΄/μ˜μ–΄ 곡개 데이터λ₯Ό μ΄μš©ν•˜μ—¬ 사전 ν•™μŠ΅ν•˜μ˜€μŠ΅λ‹ˆλ‹€. λ―Έμ„Έ μ‘°μ • ν•™μŠ΅μ„ μœ„ν•΄μ„œλ„ κ³΅κ°œλ˜μ—ˆκ±°λ‚˜ 자체 κ΅¬μΆ•ν•œ 데이터λ₯Ό μ΄μš©ν•˜μ˜€μœΌλ©° 이λ₯Ό 일뢀 κ°€κ³΅ν•˜κ±°λ‚˜ λ‹€μ‹œ μ •μ œν•˜λŠ” 과정을 κ±°μ³€μŠ΅λ‹ˆλ‹€. 
KTλŠ” 곡개 데이터λ₯Ό 직접 μˆ˜μ§‘ν•˜κ±°λ‚˜ μ λ²•ν•œ μ‚¬μš© ν—ˆκ°€ 쑰건 ν•˜μ— ν™•λ³΄ν•˜μ˜€μŠ΅λ‹ˆλ‹€. AI-HUB(https://www.aihub.or.kr/) 의 λ§λ­‰μΉ˜ 데이터와 ꡭ립ꡭ어원 λͺ¨λ‘μ˜ λ§λ­‰μΉ˜ 데이터 (https://corpus.korean.go.kr/) λ₯Ό 사전 ν•™μŠ΅ λ‹¨κ³„μ—μ„œ μ΄μš©ν•˜μ˜€μŠ΅λ‹ˆλ‹€.

KTκ°€ λ³΄μœ ν•œ 고객 λ°μ΄ν„°λŠ” μ΄μš©ν•˜μ§€ μ•Šμ•˜μŠ΅λ‹ˆλ‹€. 


The Mi:dm-bitext-S model was pre-trained using Korean/English publicly available data. For fine-tuning, we used also publicly available data and went through some processing or refining. 
KT collected public data directly or obtained it under legal permission conditions. The korean corpus data from AI-HUB (https://www.aihub.or.kr/) and the National Institute of Korean Language (https://corpus.korean.go.kr/) were used in the pre-training stage.

We did not use any customer data held by KT.


### Evaluation Results

TBA

## Limitations

KTλŠ” Mi:dm ν•™μŠ΅ λ°μ΄ν„°μ—μ„œ μš•μ„€, 비속어, 편견, 차별 λ“± λΉ„μœ€λ¦¬μ  ν‘œν˜„μ„ μ œκ±°ν•˜λ €κ³  λ…Έλ ₯ν•˜μ˜€μŠ΅λ‹ˆλ‹€. 
κ·ΈλŸΌμ—λ„ λΆˆκ΅¬ν•˜κ³  μœ„μ™€ 같은 λ°”λžŒμ§ν•˜μ§€ μ•Šμ€ ν‘œν˜„ λ˜λŠ” λΆ€μ •ν™•ν•œ 사싀이 생성될 κ°€λŠ₯성을 μ™„μ „νžˆ μ œκ±°ν•˜μ§€ λͺ»ν•˜μ˜€μŠ΅λ‹ˆλ‹€. 
λ³Έ λͺ¨λΈμ„ μ‚¬μš©ν•˜κΈ° μ „ μ΄λŸ¬ν•œ ν•œκ³„λ₯Ό μΈμ‹ν•˜κ³  μ˜¬λ°”λ₯Έ μ‚¬μš©μ„ μœ„ν•΄ ν•„μš”ν•œ 쑰치λ₯Ό μ·¨ν•˜λŠ” 것은 μ‚¬μš©μžμ˜ μ±…μž„μ΄λ©°, KTλŠ” λ³Έ λͺ¨λΈμ˜ ν™œμš©μ΄ μ•ΌκΈ°ν•˜λŠ” μœ„ν—˜μ΄λ‚˜ 손해에 λŒ€ν•΄ μ±…μž„μ„ 지지 μ•ŠμŠ΅λ‹ˆλ‹€. 

Mi:dm ν•™μŠ΅ λ°μ΄ν„°μ˜ λŒ€λΆ€λΆ„μ€ ν•œκ΅­μ–΄μ™€ μ˜μ–΄λ‘œ κ΅¬μ„±λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€. κ·Έ μ™Έ 언어에 λŒ€ν•œ 이해와 생성 κΈ°λŠ₯은 μ œκ³΅ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€. 

We tried to remove unethical expressions such as profanity, slang, prejudice, and discrimination from training data.
Nevertheless, the possibility of creating undesirable and inaccurate expressions such as the above has not been completely eliminated.
It is the user's responsibility to be aware of these limitations before utilizing this model and take the necessary actions for proper use, and KT is not responsible for any risks or damages resulting from the use of this model.

Most of Mi:dm's training data consists of Korean and English. 


## Licence

Mi:dm λͺ¨λΈ (Midm-bitext-S) 은 CC-BY-NC 4.0 λΌμ΄μ„ μŠ€ ν•˜μ— κ³΅κ°œλ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€. 
μ‚¬μš©μžλŠ” λ³Έ λͺ¨λΈμ˜ 일뢀 ν˜Ήμ€ 전체λ₯Ό μž¬ν•™μŠ΅ν•˜κ±°λ‚˜ μΌλΆ€λ§Œμ„ μ΄μš©ν•˜λŠ” 것이 κ°€λŠ₯ν•©λ‹ˆλ‹€. λ‹€λ§Œ λ°˜λ“œμ‹œ μ €μž‘μžλ₯Ό ν‘œμ‹œν•˜μ—¬μ•Ό ν•˜λ©°, 영리 λͺ©μ μœΌλ‘œ μ΄μš©ν•  수 μ—†μŠ΅λ‹ˆλ‹€. λ˜ν•œ λ³Έ λͺ¨λΈμ„ μž¬λ°°ν¬ν•˜κ±°λ‚˜ λ³Έ λͺ¨λΈμ˜ 2μ°¨μ μ €μž‘λ¬Όμ„ μž‘μ„±ν•˜μ—¬ κ³΅μœ ν•  λ•ŒλŠ” λ³Έ λͺ¨λΈκ³Ό λ™μΌν•œ CC-BY-NC 4.0 λΌμ΄μ„ μŠ€λ₯Ό μ μš©ν•˜μ—¬μ•Ό ν•©λ‹ˆλ‹€. 

Mi:dm (Midm-bitext-S) is released under the CC-BY-NC 4.0 license.
Users can retrain part or all of this model or use only part of it. However, the author must be indicated and cannot be used for commercial purposes. Additionally, when sharing secondary works using this model, they must be distributed under the same CC-BY-NC 4.0 license.

## Citations

Mi:dm을 μ΄μš©ν•œ 2μ°¨ μ €μž‘λ¬Όμ„ 배포할 경우 μ•„λž˜ λ‚΄μš©μ„ μΈμš©ν•˜μ—¬ 좜처λ₯Ό λͺ…μ‹œν•΄μ•Ό ν•©λ‹ˆλ‹€.

When distributing secondary works using Mi:dm, the source must be indicated by citing the content below.


```
@misc{kt-mi:dm,
  title         = {Mi:dm: KT Bilingual (Korean,English) Generative Pre-trained Transformer},
  author        = {KT},
  year          = {2023},
  url           = {https://huggingface.co/KT-AT/midm-bitext-S-7B-inst-v1}
  howpublished  = {\url{https://genielabs.ai}},
}
```


## Contacts

λ³Έ λͺ¨λΈμ˜ λ‹€μ–‘ν•œ 연ꡬ λͺ©μ μ˜ ν™œμš©κ³Ό κ°œμ„  μ˜κ²¬μ„ κΈ°λŒ€ ν•©λ‹ˆλ‹€. dschang@kt.com

We look forward to receiving any suggestions for improvement. dschang@kt.com