Update README.md
Browse files
README.md
CHANGED
@@ -27,52 +27,191 @@ pipeline_tag: text-generation
|
|
27 |
---
|
28 |
|
29 |
# CodeMind
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
-
|
39 |
-
-
|
40 |
-
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
##
|
52 |
-
|
53 |
-
|
54 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
55 |
| Metric | Value |
|
56 |
|--------------|--------|
|
57 |
-
|
|
58 |
-
|
|
59 |
| HellaSwag | 59.03 |
|
60 |
| MMLU | 37.26 |
|
61 |
| TruthfulQA | 43.45 |
|
62 |
| Winogrande | 59.91 |
|
63 |
| GSM8K | 8.26 |
|
64 |
|
65 |
-
##
|
66 |
-
-
|
67 |
-
-
|
68 |
-
- The model may exhibit biases present in the training data, such as favoring certain programming styles, algorithms, or approaches. It is important to consider alternative solutions and best practices when using the model's outputs.
|
69 |
-
|
70 |
-
## Ethical Considerations
|
71 |
-
- The model should be used as a supportive tool for learning and problem-solving, not as a substitute for human expertise and critical thinking.
|
72 |
-
- Users should be aware that the model's responses are generated based on patterns in the training data and may not always be accurate, complete, or up to date.
|
73 |
-
- The model should not be relied upon for making critical decisions or solving real-world problems without thorough validation and testing.
|
74 |
-
|
75 |
-
## Usage
|
76 |
-
To use the CodeMind model, you can access it through the Hugging Face model hub or by integrating it into your own applications using the provided API. Provide a coding problem or a question related to programming concepts, and the model will generate relevant explanations, code snippets, or guidance based on its training.
|
77 |
-
|
78 |
-
Please refer to the documentation and examples for detailed instructions on how to integrate and use the CodeMind model effectively.
|
|
|
27 |
---
|
28 |
|
29 |
# CodeMind
|
30 |
+
|
31 |
+
## ์๊ฐ
|
32 |
+
`CodeMind-gemma-2b`๋ ์ฝ๋ฉ ํ
์คํธ ๋ฌธ์ ํด๊ฒฐ ๋ฐ ํ๋ก๊ทธ๋๋ฐ ๊ต์ก์ ์ง์ํ๋ ์ธ์ด ๋ชจ๋ธ์
๋๋ค.
|
33 |
+
์ด ๋ชจ๋ธ์ LeetCode ์ ๋ฆฌ๊ธ๊ณผ ๊ด๋ จ ์ ํ๋ธ ์บก์
์ ํ์ฉํ์ฌ ๋ฌธ์ ํด๊ฒฐ์ ๋ํ ์ค๋ช
๊ณผ ์ฝ๋ ์์ ๋ฅผ ์ ๊ณตํฉ๋๋ค.
|
34 |
+
|
35 |
+
## ๋ชจ๋ธ ์ธ๋ถ ์ ๋ณด
|
36 |
+
- **๋ชจ๋ธ ์ด๋ฆ**: CodeMind
|
37 |
+
- **๊ธฐ๋ณธ ๋ชจ๋ธ**: google/gemma-2b
|
38 |
+
- **์ธ์ด**: ์์ด
|
39 |
+
- **๋ชจ๋ธ ํฌ๊ธฐ**: 2.51B ํ๋ผ๋ฏธํฐ
|
40 |
+
- **๋ผ์ด์ ์ค**: MIT
|
41 |
+
|
42 |
+
## ์ฃผ์ ๊ธฐ๋ฅ
|
43 |
+
- ์ฝ๋ฉ ํ
์คํธ ๋ฌธ์ ํด๊ฒฐ
|
44 |
+
- ํ๋ก๊ทธ๋๋ฐ ๊ฐ๋
์ค๋ช
|
45 |
+
- ๊ด๋ จ ์ฝ๋ ์ค๋ํซ ์์ฑ
|
46 |
+
|
47 |
+
## ํ๋ จ ๋ฐ์ดํฐ
|
48 |
+
- [**LeetCode ์ฌ์ฉ์ ์ ์ถ๋ฌผ**](https://huggingface.co/datasets/kreimben/leetcode_user_submissions): ๋ค์ํ ์๊ณ ๋ฆฌ์ฆ ๋ฌธ์ ์ ํ์ด์ฌ ์๋ฃจ์
|
49 |
+
- [**์ ํ๋ธ ์บก์
**](https://huggingface.co/datasets/kreimben/leetcode_with_youtube_captions): LeetCode ๋ฌธ์ ์ ๋ํ ์ค๋ช
๋ฐ ๋จ๊ณ๋ณ ๊ฐ์ด๋
|
50 |
+
|
51 |
+
## ์ฌ์ฉ๋ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
|
52 |
+
- [transformers](https://github.com/huggingface/transformers): ์์ฐ์ด ์ฒ๋ฆฌ ๋ชจ๋ธ์ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
|
53 |
+
- [datasets](https://github.com/huggingface/datasets): ๋ฐ์ดํฐ์
์ฒ๋ฆฌ ๋ฐ ๊ด๋ฆฌ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
|
54 |
+
- [bitsandbytes](https://github.com/TimDettmers/bitsandbytes): ์ต์ ํ๋ ์ฐ์ฐ์ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
|
55 |
+
- [peft](https://github.com/huggingface/peft): ํ์ธ ํ๋์ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
|
56 |
+
- [trl](https://github.com/huggingface/trl): ์ธ์ด ๋ชจ๋ธ ํ๋์ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
|
57 |
+
- [pandas](https://github.com/pandas-dev/pandas): ๋ฐ์ดํฐ ์กฐ์์ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
|
58 |
+
|
59 |
+
## ํ์ผ ๊ตฌ์กฐ
|
60 |
+
- **dataset/**: ๋ฐ์ดํฐ์
ํ์ผ์ ํฌํจํฉ๋๋ค.
|
61 |
+
- **eval/**: ํ๊ฐ ์คํฌ๋ฆฝํธ๋ฅผ ํฌํจํฉ๋๋ค.
|
62 |
+
- **fine-tuning/**: fine tuning ๊ด๋ จ ๋
ธํธ๋ถ ๋ฐ ์คํฌ๋ฆฝํธ๋ฅผ ํฌํจํฉ๋๋ค.
|
63 |
+
- `gemma-1.1-2b-it peft qlora.ipynb`: fine tuning ๊ณผ์ ์ ๋ํ ์ธ๋ถ ์ฌํญ์ด ํฌํจ๋ ๋
ธํธ๋ถ์
๋๋ค.
|
64 |
+
- **demo.ipynb**: ๋ฐ๋ชจ ๋
ธํธ๋ถ์ผ๋ก ๋ชจ๋ธ ์ฌ์ฉ ์์ ๊ฐ ํฌํจ๋์ด ์์ต๋๋ค.
|
65 |
+
- **requirements.txt**: ํ๋ก์ ํธ ์์กด์ฑ ๋ชฉ๋ก์ด ํฌํจ๋์ด ์์ต๋๋ค.
|
66 |
+
- **utils.py**: ์ ํธ๋ฆฌํฐ ํจ์๋ค์ด ํฌํจ๋์ด ์์ต๋๋ค.
|
67 |
+
|
68 |
+
## ์ฌ์ฉ ๋ฐฉ๋ฒ
|
69 |
+
์ด ๋ชจ๋ธ์ HuggingFace์ ๋ชจ๋ธ ํ๋ธ๋ฅผ ํตํด ์ก์ธ์คํ ์ ์์ผ๋ฉฐ, API๋ฅผ ์ฌ์ฉํ์ฌ ์์ฉ ํ๋ก๊ทธ๋จ์ ํตํฉํ ์ ์์ต๋๋ค. ์ฝ๋ฉ ๋ฌธ์ ๋๋ ํ๋ก๊ทธ๋๋ฐ ๊ด๋ จ ์ง๋ฌธ์ ์ ๊ณตํ๋ฉด ๋ชจ๋ธ์ด ๊ด๋ จ ์ค๋ช
, ์ฝ๋ ์ค๋ํซ ๋๋ ๊ฐ์ด๋๋ฅผ ์์ฑํฉ๋๋ค.
|
70 |
+
|
71 |
+
```python
|
72 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
73 |
+
|
74 |
+
tokenizer = AutoTokenizer.from_pretrained("kreimben/CodeMind-gemma-2b")
|
75 |
+
model = AutoModelForCausalLM.from_pretrained("kreimben/CodeMind-gemma-2b")
|
76 |
+
|
77 |
+
inputs = tokenizer("์ฝ๋ฉ ๋ฌธ์ ๋ ์ง๋ฌธ์ ์ฌ๊ธฐ์ ์
๋ ฅํ์ธ์", return_tensors="pt")
|
78 |
+
outputs = model.generate(inputs.input_ids)
|
79 |
+
print(tokenizer.decode(outputs[0]))
|
80 |
+
```
|
81 |
+
|
82 |
+
## ํ๋ จ ๊ณผ์
|
83 |
+
|
84 |
+
### ๋ชจ๋ธ ๋ฐ ํ ํฌ๋์ด์ ๋ก๋
|
85 |
+
```python
|
86 |
+
import os
|
87 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
|
88 |
+
|
89 |
+
bnb_config = BitsAndBytesConfig(
|
90 |
+
load_in_4bit=True,
|
91 |
+
bnb_4bit_quant_type="nf4",
|
92 |
+
bnb_4bit_compute_dtype=torch.bfloat16
|
93 |
+
)
|
94 |
+
|
95 |
+
model_id = 'google/gemma-1.1-2b-it'
|
96 |
+
token = os.getenv('HF_READ')
|
97 |
+
|
98 |
+
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"": 0}, token=token)
|
99 |
+
model.config.use_cache = False
|
100 |
+
model.gradient_checkpointing_enable()
|
101 |
+
|
102 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
103 |
+
tokenizer.padding_side = 'right'
|
104 |
+
tokenizer.pad_token = tokenizer.eos_token
|
105 |
+
```
|
106 |
+
|
107 |
+
### LoRA ๊ตฌ์ฑ ๋ฐ ๋ชจ๋ธ ์ค๋น
|
108 |
+
```python
|
109 |
+
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
|
110 |
+
import bitsandbytes as bnb
|
111 |
+
|
112 |
+
model = prepare_model_for_kbit_training(model)
|
113 |
+
|
114 |
+
def find_all_linear_names(model):
|
115 |
+
cls = bnb.nn.Linear4bit
|
116 |
+
lora_module_names = set()
|
117 |
+
for name, module in model.named_modules():
|
118 |
+
if isinstance(module, cls):
|
119 |
+
names = name.split('.')
|
120 |
+
lora_module_names.add(names[0] if len(names) == 1 else names[-1])
|
121 |
+
if 'lm_head' in lora_module_names:
|
122 |
+
lora_module_names.remove('lm_head')
|
123 |
+
return list(lora_module_names)
|
124 |
+
|
125 |
+
modules = find_all_linear_names(model)
|
126 |
+
lora_config = LoraConfig(
|
127 |
+
r=64,
|
128 |
+
lora_alpha=32,
|
129 |
+
target_modules=modules,
|
130 |
+
lora_dropout=0.05,
|
131 |
+
bias="none",
|
132 |
+
task_type="CAUSAL_LM"
|
133 |
+
)
|
134 |
+
|
135 |
+
model = get_peft_model(model, lora_config)
|
136 |
+
```
|
137 |
+
|
138 |
+
### ๋ฐ์ดํฐ ์ค๋น
|
139 |
+
```python
|
140 |
+
import pandas as pd
|
141 |
+
from datasets import Dataset
|
142 |
+
|
143 |
+
submission_dataset = datasets.load_dataset('kreimben/leetcode_user_submissions_only_python', split='train').to_pandas()
|
144 |
+
submission_dataset = submission_dataset[['title', 'question_hints', 'question_content', 'content']]
|
145 |
+
captions_dataset = datasets.load_dataset('kreimben/leetcode_with_youtube_captions', split='train').to_pandas()
|
146 |
+
captions_dataset = captions_dataset[['title', 'question_hints', 'question_content', 'cc_content']]
|
147 |
+
captions_dataset.rename(columns={'cc_content': 'content'}, inplace=True)
|
148 |
+
|
149 |
+
dataset = pd.concat([submission_dataset, captions_dataset])
|
150 |
+
del submission_dataset, captions_dataset
|
151 |
+
|
152 |
+
dataset = Dataset.from_pandas(dataset)
|
153 |
+
GEMMA_2B_IT_MODEL_PREFIX_TEXT = "Below is an coding test problem. Solve the question."
|
154 |
+
|
155 |
+
def generate_prompt(data_point):
|
156 |
+
return f"<bos><start_of_turn>user {GEMMA_2B_IT_MODEL_PREFIX_TEXT}
|
157 |
+
|
158 |
+
I don't know {data_point['title']} problem. give me the insight or appoach.
|
159 |
+
|
160 |
+
this is problem's hint.
|
161 |
+
{data_point['question_hints']}
|
162 |
+
|
163 |
+
here are some content of question.
|
164 |
+
{data_point['question_content']}<end_of_turn>
|
165 |
+
<start_of_turn>model {data_point['content']}<end_of_turn><eos>"
|
166 |
+
|
167 |
+
text_column = [generate_prompt(data_point) for data_point in dataset]
|
168 |
+
dataset = dataset.add_column("prompt", text_column)
|
169 |
+
```
|
170 |
+
|
171 |
+
### ํ๋ จ
|
172 |
+
```python
|
173 |
+
from trl import SFTTrainer
|
174 |
+
import transformers
|
175 |
+
import torch
|
176 |
+
|
177 |
+
tokenizer.pad_token = tokenizer.eos_token
|
178 |
+
torch.cuda.empty_cache()
|
179 |
+
|
180 |
+
trainer = SFTTrainer(
|
181 |
+
model=model,
|
182 |
+
train_dataset=dataset,
|
183 |
+
dataset_text_field="prompt",
|
184 |
+
peft_config=lora_config,
|
185 |
+
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
|
186 |
+
args=transformers.TrainingArguments(
|
187 |
+
output_dir='out',
|
188 |
+
bf16=True,
|
189 |
+
max_steps=100,
|
190 |
+
warmup_steps=50,
|
191 |
+
per_device_train_batch_size=1,
|
192 |
+
gradient_accumulation_steps=1,
|
193 |
+
optim="paged_adamw_8bit",
|
194 |
+
logging_steps=20,
|
195 |
+
report_to='wandb',
|
196 |
+
),
|
197 |
+
)
|
198 |
+
|
199 |
+
trainer.train()
|
200 |
+
```
|
201 |
+
|
202 |
+
## ํ๊ฐ
|
203 |
+
๋ชจ๋ธ์ ์ฑ๋ฅ์ ๋ค์๊ณผ ๊ฐ์ด ํ๊ฐ๋์์ต๋๋ค:
|
204 |
+
|
205 |
| Metric | Value |
|
206 |
|--------------|--------|
|
207 |
+
| Average | 41.62 |
|
208 |
+
| ARC | 41.81 |
|
209 |
| HellaSwag | 59.03 |
|
210 |
| MMLU | 37.26 |
|
211 |
| TruthfulQA | 43.45 |
|
212 |
| Winogrande | 59.91 |
|
213 |
| GSM8K | 8.26 |
|
214 |
|
215 |
+
## ์ ํ ์ฌํญ ๋ฐ ์ค๋ฆฌ์ ๊ณ ๋ ค์ฌํญ
|
216 |
+
- ๋ชจ๋ธ์ ์ถ๋ ฅ์ ํ์ต ๋ฐ์ดํฐ์ ๊ธฐ๋ฐํ๋ฏ๋ก ํญ์ ์ ํํ์ง ์์ ์ ์์ต๋๋ค.
|
217 |
+
- ์ค์ํ ๊ฒฐ์ ์ด๋ ์ค์ธ๊ณ ๋ฌธ์ ํด๊ฒฐ์ ๋ชจ๋ธ ์ถ๋ ฅ์ ์ฌ์ฉํ๊ธฐ ์ ์ ๋ฐ๋์ ๊ฒ์ฆ์ด ํ์ํฉ๋๋ค.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|