File size: 7,643 Bytes
66f2a32
4a9fd63
 
8dfe462
9f6dc21
61f245e
 
4a9fd63
b7502fd
4a9fd63
 
 
9f6dc21
f3534a5
 
354f552
 
 
 
 
9f6dc21
 
ede0034
dac597a
5991483
4fb0663
 
 
31f0f64
 
9af0318
31f0f64
 
 
9af0318
 
31f0f64
9af0318
 
 
 
31f0f64
 
9af0318
 
31f0f64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dac597a
 
31f0f64
 
dac597a
 
 
 
 
d42a476
31f0f64
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
---
language:
- en
license: mit
library_name: transformers
tags:
- code
base_model:
- google/gemma-1.1-2b-it
datasets:
- kreimben/leetcode_with_youtube_captions
- kreimben/leetcode_user_submissions
widget:
- text: explain about two sum problem. from brute force approach to the most advanced
    algorithms.
  example_title: two sum example
- text: explain about leetcode 72 edit distance. i don't get even the approach.
  example_title: edit distance example
- text: explain about leetcode 139 Word Break. please give me the approach.
  example_title: word break example
inference:
  parameters:
    max_new_tokens: 250
    temperature: 0.3
pipeline_tag: text-generation
---

# CodeMind

## ์†Œ๊ฐœ
์ฝ”๋”ฉ ํ…Œ์ŠคํŠธ ๋ฌธ์ œ ํ•ด๊ฒฐ ๋ฐ ํ•™์Šต ๋ณด์กฐ๋ฅผ ์ง€์›ํ•ด ์ฃผ๋Š” ์–ธ์–ด ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. Leetcode ํ•ด์„ค ์˜์ƒ ์ž๋ง‰ ๋ฐ ์œ ์ €๋“ค์˜ ํฌ์ŠคํŒ… ๊ธ€์„ ์ด์šฉํ•ด ํŒŒ์ธํŠœ๋‹ํ•˜์—ฌ ์ฝ”๋”ฉ ํ…Œ์ŠคํŠธ์— ์กฐ๊ธˆ ๋” ํŠนํ™”๋œ ๋‹ต์•ˆ์„ ์ œ์‹œํ•ด ์ค„ ์ˆ˜ ์žˆ๊ฒŒ ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

## ๋ชจ๋ธ ์„ธ๋ถ€ ์ •๋ณด
- **๋ชจ๋ธ ์ด๋ฆ„**: CodeMind
- **๊ธฐ๋ณธ ๋ชจ๋ธ**: google/gemma-1.1-2b-it
- **ํ›ˆ๋ จ ์–ธ์–ด**: ์˜์–ด
- **๋ชจ๋ธ ํฌ๊ธฐ**: 2.51B ํŒŒ๋ผ๋ฏธํ„ฐ

## ํŒ€์› ๊ตฌ์„ฑ
- NLP 3๋ช…
- SRE 2๋ช…

## ์ฃผ์š” ๊ธฐ๋Šฅ
- ๋ฌธ์ œ ์œ ํ˜• ๋ฐ ์ ‘๊ทผ๋ฒ• ์„ค๋ช…
- ์ •๋‹ต ์ฝ”๋“œ ์ƒ์„ฑ

## ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ
- [**LeetCode ์‚ฌ์šฉ์ž ์ œ์ถœ๋ฌผ**](https://huggingface.co/datasets/kreimben/leetcode_user_submissions): ๋‹ค์–‘ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ฌธ์ œ์˜ ํŒŒ์ด์ฌ ์†”๋ฃจ์…˜
- [**์œ ํŠœ๋ธŒ ์บก์…˜**](https://huggingface.co/datasets/kreimben/leetcode_with_youtube_captions): LeetCode ๋ฌธ์ œ์— ๋Œ€ํ•œ ์„ค๋ช… ๋ฐ ๋‹จ๊ณ„๋ณ„ ๊ฐ€์ด๋“œ

## ์‚ฌ์šฉ๋œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
- [transformers](https://github.com/huggingface/transformers): ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ์„ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
- [datasets](https://github.com/huggingface/datasets): ๋ฐ์ดํ„ฐ์…‹ ์ฒ˜๋ฆฌ ๋ฐ ๊ด€๋ฆฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
- [bitsandbytes](https://github.com/TimDettmers/bitsandbytes): ์ตœ์ ํ™”๋œ ์—ฐ์‚ฐ์„ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
- [peft](https://github.com/huggingface/peft): ํŒŒ์ธ ํŠœ๋‹์„ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
- [trl](https://github.com/huggingface/trl): ์–ธ์–ด ๋ชจ๋ธ ํŠœ๋‹์„ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
- [pandas](https://github.com/pandas-dev/pandas): ๋ฐ์ดํ„ฐ ์กฐ์ž‘์„ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

## ํŒŒ์ผ ๊ตฌ์กฐ
- **dataset/**: ๋ฐ์ดํ„ฐ์…‹ ํŒŒ์ผ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
- **eval/**: ํ‰๊ฐ€ ์Šคํฌ๋ฆฝํŠธ๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
- **fine-tuning/**: fine tuning ๊ด€๋ จ ๋…ธํŠธ๋ถ ๋ฐ ์Šคํฌ๋ฆฝํŠธ๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
  - `gemma-1.1-2b-it peft qlora.ipynb`: fine tuning ๊ณผ์ •์— ๋Œ€ํ•œ ์„ธ๋ถ€ ์‚ฌํ•ญ์ด ํฌํ•จ๋œ ๋…ธํŠธ๋ถ์ž…๋‹ˆ๋‹ค.
- **demo.ipynb**: ๋ฐ๋ชจ ๋…ธํŠธ๋ถ์œผ๋กœ ๋ชจ๋ธ ์‚ฌ์šฉ ์˜ˆ์ œ๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
- **requirements.txt**: ํ”„๋กœ์ ํŠธ ์˜์กด์„ฑ ๋ชฉ๋ก์ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
- **utils.py**: ์œ ํ‹ธ๋ฆฌํ‹ฐ ํ•จ์ˆ˜๋“ค์ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

## ์‚ฌ์šฉ ๋ฐฉ๋ฒ•
์ด ๋ชจ๋ธ์€ HuggingFace์˜ ๋ชจ๋ธ ํ—ˆ๋ธŒ๋ฅผ ํ†ตํ•ด ์•ก์„ธ์Šคํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, API๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์— ํ†ตํ•ฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฝ”๋”ฉ ๋ฌธ์ œ ๋˜๋Š” ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ด€๋ จ ์งˆ๋ฌธ์„ ์ œ๊ณตํ•˜๋ฉด ๋ชจ๋ธ์ด ๊ด€๋ จ ์„ค๋ช…, ์ฝ”๋“œ ์Šค๋‹ˆํŽซ ๋˜๋Š” ๊ฐ€์ด๋“œ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("kreimben/CodeMind-gemma-2b")
model = AutoModelForCausalLM.from_pretrained("kreimben/CodeMind-gemma-2b")

inputs = tokenizer("์ฝ”๋”ฉ ๋ฌธ์ œ๋‚˜ ์งˆ๋ฌธ์„ ์—ฌ๊ธฐ์— ์ž…๋ ฅํ•˜์„ธ์š”", return_tensors="pt")
outputs = model.generate(inputs.input_ids)
print(tokenizer.decode(outputs[0]))
```

## ํ›ˆ๋ จ ๊ณผ์ •

### ๋ชจ๋ธ ๋ฐ ํ† ํฌ๋‚˜์ด์ € ๋กœ๋“œ
```python
import os
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_id = 'google/gemma-1.1-2b-it'
token = os.getenv('HF_READ')

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"": 0}, token=token)
model.config.use_cache = False
model.gradient_checkpointing_enable()

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token
```

### LoRA ๊ตฌ์„ฑ ๋ฐ ๋ชจ๋ธ ์ค€๋น„
```python
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import bitsandbytes as bnb

model = prepare_model_for_kbit_training(model)

def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
        if 'lm_head' in lora_module_names:
            lora_module_names.remove('lm_head')
    return list(lora_module_names)

modules = find_all_linear_names(model)
lora_config = LoraConfig(
    r=64,
    lora_alpha=32,
    target_modules=modules,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
```

### ๋ฐ์ดํ„ฐ ์ค€๋น„
```python
import pandas as pd
from datasets import Dataset

submission_dataset = datasets.load_dataset('kreimben/leetcode_user_submissions_only_python', split='train').to_pandas()
submission_dataset = submission_dataset[['title', 'question_hints', 'question_content', 'content']]
captions_dataset = datasets.load_dataset('kreimben/leetcode_with_youtube_captions', split='train').to_pandas()
captions_dataset = captions_dataset[['title', 'question_hints', 'question_content', 'cc_content']]
captions_dataset.rename(columns={'cc_content': 'content'}, inplace=True)

dataset = pd.concat([submission_dataset, captions_dataset])
del submission_dataset, captions_dataset

dataset = Dataset.from_pandas(dataset)
GEMMA_2B_IT_MODEL_PREFIX_TEXT = "Below is an coding test problem. Solve the question."

def generate_prompt(data_point):
    return f"<bos><start_of_turn>user {GEMMA_2B_IT_MODEL_PREFIX_TEXT}

I don't know {data_point['title']} problem. give me the insight or appoach.

this is problem's hint.
{data_point['question_hints']}

here are some content of question.
{data_point['question_content']}<end_of_turn>
<start_of_turn>model {data_point['content']}<end_of_turn><eos>"

text_column = [generate_prompt(data_point) for data_point in dataset]
dataset = dataset.add_column("prompt", text_column)
```

### ํ›ˆ๋ จ
```python
from trl import SFTTrainer
import transformers
import torch

tokenizer.pad_token = tokenizer.eos_token
torch.cuda.empty_cache()

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="prompt",
    peft_config=lora_config,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
    args=transformers.TrainingArguments(
        output_dir='out',
        bf16=True,
        max_steps=100,
        warmup_steps=50,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=1,
        optim="paged_adamw_8bit",
        logging_steps=20,
        report_to='wandb',
    ),
)

trainer.train()
```

## ํ‰๊ฐ€
๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‰๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค:

| Metric       | Value  |
|--------------|--------|
| Average      | 41.62  |
| ARC          | 41.81  |
| HellaSwag    | 59.03  |
| MMLU         | 37.26  |
| TruthfulQA   | 43.45  |
| Winogrande   | 59.91  |
| GSM8K        | 8.26   |

## ์ œํ•œ ์‚ฌํ•ญ ๋ฐ ์œค๋ฆฌ์  ๊ณ ๋ ค์‚ฌํ•ญ
- ๋ชจ๋ธ์˜ ์ถœ๋ ฅ์€ ํ•™์Šต ๋ฐ์ดํ„ฐ์— ๊ธฐ๋ฐ˜ํ•˜๋ฏ€๋กœ ํ•ญ์ƒ ์ •ํ™•ํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
- ์ค‘์š”ํ•œ ๊ฒฐ์ •์ด๋‚˜ ์‹ค์„ธ๊ณ„ ๋ฌธ์ œ ํ•ด๊ฒฐ์— ๋ชจ๋ธ ์ถœ๋ ฅ์„ ์‚ฌ์šฉํ•˜๊ธฐ ์ „์— ๋ฐ˜๋“œ์‹œ ๊ฒ€์ฆ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.