Omarrran commited on
Commit
0904aca
ยท
verified ยท
1 Parent(s): 37916dd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +151 -0
README.md CHANGED
@@ -15,6 +15,33 @@ library_name: adapter-transformers
15
 
16
 
17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  ### Load this model as:
19
  ```python
20
  from llama_cpp import Llama
@@ -51,6 +78,130 @@ if __name__ == "__main__":
51
  response = generate_text(prompt)
52
  print(f"Prompt: {prompt}\nResponse: {response}")
53
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
  ### Key Fixes Added:
56
  1. **Model Download**: Uses `huggingface_hub` to properly download the GGUF file
 
15
 
16
 
17
 
18
+
19
+
20
+
21
+
22
+ # Llama-3.2-3B-
23
+
24
+ ![License](https://img.shields.io/badge/License-Apache%202.0-blue)
25
+ ![Python](https://img.shields.io/badge/Python-3.8%2B-green)
26
+ ![Framework](https://img.shields.io/badge/Framework-Unsloth-ff69b4)
27
+ ![Model](https://img.shields.io/badge/Model-Llama_3.2_3B-orange)
28
+
29
+
30
+
31
+ This repository contains code to fine-tune the **Llama-3.2-3B-Instruct** model using Unsloth for efficient training. The model is optimized for conversational tasks and supports 4-bit quantization, LoRA adapters, and GGUF export.
32
+
33
+ ## Model Overview
34
+ - **Base Model**: [`Llama-3.2-3B-Instruct`](https://huggingface.co/unsloth/Llama-3.2-3B-Instruct)
35
+ - **Fine-Tuning Dataset**: [FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) (converted to Llama-3.1 chat format)
36
+ - **Features**:
37
+ - 4-bit quantization for reduced memory usage
38
+ - LoRA adapters (1-10% parameter updates)
39
+ - Sequence length: 2048 (RoPE scaling supported)
40
+ - Optimized for Tesla T4 GPUs
41
+
42
+ ## ๐Ÿš€ Quick Start
43
+
44
+
45
  ### Load this model as:
46
  ```python
47
  from llama_cpp import Llama
 
78
  response = generate_text(prompt)
79
  print(f"Prompt: {prompt}\nResponse: {response}")
80
  ```
81
+ ### Installation
82
+ ```bash
83
+ pip install unsloth
84
+ pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
85
+ ```
86
+
87
+ ### Load Model
88
+ ```python
89
+ from unsloth import FastLanguageModel
90
+
91
+ model, tokenizer = FastLanguageModel.from_pretrained(
92
+ model_name="unsloth/Llama-3.2-3B-Instruct",
93
+ max_seq_length=2048,
94
+ dtype=None, # Auto-detect (bf16 for Ampere+ GPUs)
95
+ load_in_4bit=True,
96
+ )
97
+ ```
98
+
99
+ ### Run Inference
100
+ ```python
101
+ messages = [{"role": "user", "content": "Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8,"}]
102
+ inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
103
+
104
+ outputs = model.generate(
105
+ inputs,
106
+ max_new_tokens=64,
107
+ temperature=1.5,
108
+ min_p=0.1,
109
+ )
110
+ print(tokenizer.decode(outputs[0]))
111
+ ```
112
+
113
+ ## ๐Ÿ› ๏ธ Training
114
+
115
+ ### Data Preparation
116
+ The dataset is standardized to Llama-3.1 chat format:
117
+ ```python
118
+ from unsloth.chat_templates import get_chat_template, standardize_sharegpt
119
+
120
+ tokenizer = get_chat_template(tokenizer, "llama-3.1") # Adds system prompts
121
+ dataset = load_dataset("mlabonne/FineTome-100k", split="train")
122
+ dataset = standardize_sharegpt(dataset) # Converts to role/content format
123
+ ```
124
+
125
+ ### LoRA Configuration
126
+ ```python
127
+ model = FastLanguageModel.get_peft_model(
128
+ model,
129
+ r=16, # LoRA rank
130
+ target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
131
+ lora_alpha=16,
132
+ use_gradient_checkpointing="unsloth", # 30% less VRAM
133
+ )
134
+ ```
135
+
136
+ ### Training Arguments
137
+ ```python
138
+ from trl import SFTTrainer
139
+
140
+ trainer = SFTTrainer(
141
+ model=model,
142
+ train_dataset=dataset,
143
+ dataset_text_field="text",
144
+ max_seq_length=2048,
145
+ args=TrainingArguments(
146
+ per_device_train_batch_size=2,
147
+ gradient_accumulation_steps=4,
148
+ learning_rate=2e-4,
149
+ max_steps=60, # Demo: set to 60 steps. For full training, use num_train_epochs=1
150
+ fp16=not is_bfloat16_supported(),
151
+ bf16=is_bfloat16_supported(),
152
+ optim="adamw_8bit",
153
+ ),
154
+ )
155
+ ```
156
+
157
+ ## ๐Ÿ’พ Saving & Deployment
158
+
159
+ ### Save LoRA Adapters
160
+ ```python
161
+ model.save_pretrained("llama3_2_3B")
162
+ tokenizer.save_pretrained("llama3_2_3B")
163
+ ```
164
+
165
+ ### Export to GGUF (for llama.cpp)
166
+ ```python
167
+ model.save_pretrained_gguf(
168
+ "model",
169
+ tokenizer,
170
+ quantization_method="q4_k_m", # Recommended quantization
171
+ )
172
+ ```
173
+
174
+ ### Upload to Hugging Face Hub
175
+ ```python
176
+ model.push_to_hub_gguf(
177
+ "your-username/llama3_2_3B",
178
+ tokenizer,
179
+ quantization_method=["q4_k_m", "q8_0"], # Multiple formats
180
+ token="hf_your_token_here",
181
+ )
182
+ ```
183
+
184
+ ## ๐Ÿ“Š Performance
185
+ | Metric | Value |
186
+ |----------------------|----------------|
187
+ | Training Time (60 steps) | ~7.5 minutes |
188
+ | Peak VRAM Usage | 6.5 GB |
189
+ | Quantized Size (Q4_K_M) | ~1.9 GB |
190
+
191
+ ## ๐Ÿ“œ Notes
192
+ - **Knowledge Cutoff**: December 2023 (updated to July 2024 via fine-tuning)
193
+ - Use `temperature=1.5` and `min_p=0.1` for best results ([reference](https://x.com/menhguin/status/1826132708508213629))
194
+ - For 2x faster inference, enable `FastLanguageModel.for_inference(model)`
195
+
196
+ ## ๐Ÿค Contributing
197
+ - Report issues
198
+ - Star the repo if you find this useful! โญ
199
+
200
+ ## License
201
+ Apache 2.0. See [LICENSE on top of Model Card]
202
+ ```
203
+
204
+ ```
205
 
206
  ### Key Fixes Added:
207
  1. **Model Download**: Uses `huggingface_hub` to properly download the GGUF file