File size: 7,643 Bytes
66f2a32 4a9fd63 8dfe462 9f6dc21 61f245e 4a9fd63 b7502fd 4a9fd63 9f6dc21 f3534a5 354f552 9f6dc21 ede0034 dac597a 5991483 4fb0663 31f0f64 9af0318 31f0f64 9af0318 31f0f64 9af0318 31f0f64 9af0318 31f0f64 dac597a 31f0f64 dac597a d42a476 31f0f64 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 |
---
language:
- en
license: mit
library_name: transformers
tags:
- code
base_model:
- google/gemma-1.1-2b-it
datasets:
- kreimben/leetcode_with_youtube_captions
- kreimben/leetcode_user_submissions
widget:
- text: explain about two sum problem. from brute force approach to the most advanced
algorithms.
example_title: two sum example
- text: explain about leetcode 72 edit distance. i don't get even the approach.
example_title: edit distance example
- text: explain about leetcode 139 Word Break. please give me the approach.
example_title: word break example
inference:
parameters:
max_new_tokens: 250
temperature: 0.3
pipeline_tag: text-generation
---
# CodeMind
## ์๊ฐ
์ฝ๋ฉ ํ
์คํธ ๋ฌธ์ ํด๊ฒฐ ๋ฐ ํ์ต ๋ณด์กฐ๋ฅผ ์ง์ํด ์ฃผ๋ ์ธ์ด ๋ชจ๋ธ์
๋๋ค. Leetcode ํด์ค ์์ ์๋ง ๋ฐ ์ ์ ๋ค์ ํฌ์คํ
๊ธ์ ์ด์ฉํด ํ์ธํ๋ํ์ฌ ์ฝ๋ฉ ํ
์คํธ์ ์กฐ๊ธ ๋ ํนํ๋ ๋ต์์ ์ ์ํด ์ค ์ ์๊ฒ ํ์์ต๋๋ค.
## ๋ชจ๋ธ ์ธ๋ถ ์ ๋ณด
- **๋ชจ๋ธ ์ด๋ฆ**: CodeMind
- **๊ธฐ๋ณธ ๋ชจ๋ธ**: google/gemma-1.1-2b-it
- **ํ๋ จ ์ธ์ด**: ์์ด
- **๋ชจ๋ธ ํฌ๊ธฐ**: 2.51B ํ๋ผ๋ฏธํฐ
## ํ์ ๊ตฌ์ฑ
- NLP 3๋ช
- SRE 2๋ช
## ์ฃผ์ ๊ธฐ๋ฅ
- ๋ฌธ์ ์ ํ ๋ฐ ์ ๊ทผ๋ฒ ์ค๋ช
- ์ ๋ต ์ฝ๋ ์์ฑ
## ํ๋ จ ๋ฐ์ดํฐ
- [**LeetCode ์ฌ์ฉ์ ์ ์ถ๋ฌผ**](https://huggingface.co/datasets/kreimben/leetcode_user_submissions): ๋ค์ํ ์๊ณ ๋ฆฌ์ฆ ๋ฌธ์ ์ ํ์ด์ฌ ์๋ฃจ์
- [**์ ํ๋ธ ์บก์
**](https://huggingface.co/datasets/kreimben/leetcode_with_youtube_captions): LeetCode ๋ฌธ์ ์ ๋ํ ์ค๋ช
๋ฐ ๋จ๊ณ๋ณ ๊ฐ์ด๋
## ์ฌ์ฉ๋ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- [transformers](https://github.com/huggingface/transformers): ์์ฐ์ด ์ฒ๋ฆฌ ๋ชจ๋ธ์ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- [datasets](https://github.com/huggingface/datasets): ๋ฐ์ดํฐ์
์ฒ๋ฆฌ ๋ฐ ๊ด๋ฆฌ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- [bitsandbytes](https://github.com/TimDettmers/bitsandbytes): ์ต์ ํ๋ ์ฐ์ฐ์ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- [peft](https://github.com/huggingface/peft): ํ์ธ ํ๋์ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- [trl](https://github.com/huggingface/trl): ์ธ์ด ๋ชจ๋ธ ํ๋์ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- [pandas](https://github.com/pandas-dev/pandas): ๋ฐ์ดํฐ ์กฐ์์ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
## ํ์ผ ๊ตฌ์กฐ
- **dataset/**: ๋ฐ์ดํฐ์
ํ์ผ์ ํฌํจํฉ๋๋ค.
- **eval/**: ํ๊ฐ ์คํฌ๋ฆฝํธ๋ฅผ ํฌํจํฉ๋๋ค.
- **fine-tuning/**: fine tuning ๊ด๋ จ ๋
ธํธ๋ถ ๋ฐ ์คํฌ๋ฆฝํธ๋ฅผ ํฌํจํฉ๋๋ค.
- `gemma-1.1-2b-it peft qlora.ipynb`: fine tuning ๊ณผ์ ์ ๋ํ ์ธ๋ถ ์ฌํญ์ด ํฌํจ๋ ๋
ธํธ๋ถ์
๋๋ค.
- **demo.ipynb**: ๋ฐ๋ชจ ๋
ธํธ๋ถ์ผ๋ก ๋ชจ๋ธ ์ฌ์ฉ ์์ ๊ฐ ํฌํจ๋์ด ์์ต๋๋ค.
- **requirements.txt**: ํ๋ก์ ํธ ์์กด์ฑ ๋ชฉ๋ก์ด ํฌํจ๋์ด ์์ต๋๋ค.
- **utils.py**: ์ ํธ๋ฆฌํฐ ํจ์๋ค์ด ํฌํจ๋์ด ์์ต๋๋ค.
## ์ฌ์ฉ ๋ฐฉ๋ฒ
์ด ๋ชจ๋ธ์ HuggingFace์ ๋ชจ๋ธ ํ๋ธ๋ฅผ ํตํด ์ก์ธ์คํ ์ ์์ผ๋ฉฐ, API๋ฅผ ์ฌ์ฉํ์ฌ ์์ฉ ํ๋ก๊ทธ๋จ์ ํตํฉํ ์ ์์ต๋๋ค. ์ฝ๋ฉ ๋ฌธ์ ๋๋ ํ๋ก๊ทธ๋๋ฐ ๊ด๋ จ ์ง๋ฌธ์ ์ ๊ณตํ๋ฉด ๋ชจ๋ธ์ด ๊ด๋ จ ์ค๋ช
, ์ฝ๋ ์ค๋ํซ ๋๋ ๊ฐ์ด๋๋ฅผ ์์ฑํฉ๋๋ค.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("kreimben/CodeMind-gemma-2b")
model = AutoModelForCausalLM.from_pretrained("kreimben/CodeMind-gemma-2b")
inputs = tokenizer("์ฝ๋ฉ ๋ฌธ์ ๋ ์ง๋ฌธ์ ์ฌ๊ธฐ์ ์
๋ ฅํ์ธ์", return_tensors="pt")
outputs = model.generate(inputs.input_ids)
print(tokenizer.decode(outputs[0]))
```
## ํ๋ จ ๊ณผ์
### ๋ชจ๋ธ ๋ฐ ํ ํฌ๋์ด์ ๋ก๋
```python
import os
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model_id = 'google/gemma-1.1-2b-it'
token = os.getenv('HF_READ')
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"": 0}, token=token)
model.config.use_cache = False
model.gradient_checkpointing_enable()
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token
```
### LoRA ๊ตฌ์ฑ ๋ฐ ๋ชจ๋ธ ์ค๋น
```python
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import bitsandbytes as bnb
model = prepare_model_for_kbit_training(model)
def find_all_linear_names(model):
cls = bnb.nn.Linear4bit
lora_module_names = set()
for name, module in model.named_modules():
if isinstance(module, cls):
names = name.split('.')
lora_module_names.add(names[0] if len(names) == 1 else names[-1])
if 'lm_head' in lora_module_names:
lora_module_names.remove('lm_head')
return list(lora_module_names)
modules = find_all_linear_names(model)
lora_config = LoraConfig(
r=64,
lora_alpha=32,
target_modules=modules,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
```
### ๋ฐ์ดํฐ ์ค๋น
```python
import pandas as pd
from datasets import Dataset
submission_dataset = datasets.load_dataset('kreimben/leetcode_user_submissions_only_python', split='train').to_pandas()
submission_dataset = submission_dataset[['title', 'question_hints', 'question_content', 'content']]
captions_dataset = datasets.load_dataset('kreimben/leetcode_with_youtube_captions', split='train').to_pandas()
captions_dataset = captions_dataset[['title', 'question_hints', 'question_content', 'cc_content']]
captions_dataset.rename(columns={'cc_content': 'content'}, inplace=True)
dataset = pd.concat([submission_dataset, captions_dataset])
del submission_dataset, captions_dataset
dataset = Dataset.from_pandas(dataset)
GEMMA_2B_IT_MODEL_PREFIX_TEXT = "Below is an coding test problem. Solve the question."
def generate_prompt(data_point):
return f"<bos><start_of_turn>user {GEMMA_2B_IT_MODEL_PREFIX_TEXT}
I don't know {data_point['title']} problem. give me the insight or appoach.
this is problem's hint.
{data_point['question_hints']}
here are some content of question.
{data_point['question_content']}<end_of_turn>
<start_of_turn>model {data_point['content']}<end_of_turn><eos>"
text_column = [generate_prompt(data_point) for data_point in dataset]
dataset = dataset.add_column("prompt", text_column)
```
### ํ๋ จ
```python
from trl import SFTTrainer
import transformers
import torch
tokenizer.pad_token = tokenizer.eos_token
torch.cuda.empty_cache()
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
dataset_text_field="prompt",
peft_config=lora_config,
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
args=transformers.TrainingArguments(
output_dir='out',
bf16=True,
max_steps=100,
warmup_steps=50,
per_device_train_batch_size=1,
gradient_accumulation_steps=1,
optim="paged_adamw_8bit",
logging_steps=20,
report_to='wandb',
),
)
trainer.train()
```
## ํ๊ฐ
๋ชจ๋ธ์ ์ฑ๋ฅ์ ๋ค์๊ณผ ๊ฐ์ด ํ๊ฐ๋์์ต๋๋ค:
| Metric | Value |
|--------------|--------|
| Average | 41.62 |
| ARC | 41.81 |
| HellaSwag | 59.03 |
| MMLU | 37.26 |
| TruthfulQA | 43.45 |
| Winogrande | 59.91 |
| GSM8K | 8.26 |
## ์ ํ ์ฌํญ ๋ฐ ์ค๋ฆฌ์ ๊ณ ๋ ค์ฌํญ
- ๋ชจ๋ธ์ ์ถ๋ ฅ์ ํ์ต ๋ฐ์ดํฐ์ ๊ธฐ๋ฐํ๋ฏ๋ก ํญ์ ์ ํํ์ง ์์ ์ ์์ต๋๋ค.
- ์ค์ํ ๊ฒฐ์ ์ด๋ ์ค์ธ๊ณ ๋ฌธ์ ํด๊ฒฐ์ ๋ชจ๋ธ ์ถ๋ ฅ์ ์ฌ์ฉํ๊ธฐ ์ ์ ๋ฐ๋์ ๊ฒ์ฆ์ด ํ์ํฉ๋๋ค.
|