File size: 13,391 Bytes
00f4a0b 992d190 48116da 8c96772 de12b58 8c96772 de12b58 00f4a0b 99fd578 9cc3643 ee1406d 00f4a0b 1876307 1b4702e ee1406d c00ba5e 72d9f1c 24116da 72d9f1c ee1406d 48116da 61c0c24 6cab427 48116da 6cab427 48116da 6cab427 00f4a0b e7bfb1f ee1406d e7bfb1f 7285d71 e7bfb1f 2bf682e 440a2bc 7c9ad36 440a2bc 2bf682e 325b249 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 |
---
base_model: HuggingFaceTB/SmolLM-135M
datasets:
- LDJnr/Capybara
inference:
parameters:
model_file: biggie_groked_int8_q8_0.gguf
temperature: 1
license: mit
---
### TINY Frankenstein of [SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M) upped to 0.18b
Use this frankenbase for training.
Sorry for the mislabelling, the model is a 0.18b 181m parameter, not 0.15.
I did not except this repo to blow up and now all the training scripts depend on it.
* ## CITE WORK FROM THIS HF PAGE AND [@cognitivecompai](https://huggingface.co/ehartford)'s OPTIMIZER ON YOUR FUTURE PAPERS OR I WILL DRAG YOUR ORG ON TWITTER LIKE I DID WITH COHERE LOL (we're cool now btw, visited them :)
* https://github.com/cognitivecomputations/grokadamw
* https://github.com/SakanaAI/evolutionary-model-merge/
* https://huggingface.co/blog/smollm
>>[!TIP]π§ If you're impppatient, get the trained checkpoint file that runs on 1 cpu core:
>>
>>wget https://huggingface.co/nisten/Biggie-SmoLlm-0.15B-Base/resolve/main/biggie_groked_int8_q8_0.gguf
>>
>>make sure to install latest llama.cpp first, it's easy on linux & mac:
>>
>> git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make -j
Now for the magic trained finetune that runs at insane speeds:
The settings are very finicky so be careful with your experimentation
```verilog
./llama-cli -fa -b 512 -ctv q8_0 -ctk q8_0 --min-p 0.3 --top-p 0.85 --keep -1 \
-p "You are a NASA JPL Scientists. Human: I want to bring my cat to mars." \
--in-prefix "<|im_start|>Human:" --reverse-prompt "Human:" \
-m biggie_groked_int8_q8_0.gguf -co -cnv \
-c 1024 -n 700 --temp 1.5 -ngl 0 -t 1
```
Yup, that's no gpu, 1 cpu core.
This base model was built one via semi-automated continuous merging to figure out the recipe.
Model is more coherent.
The temperature settings and min p etc need to be adjusted but even at default temp0 it was coherent for first 100 tokens.
Amazing option for further training. And this is a merge of the base, not the instruct!
## π§ What's Really Going Down Here?
We're talking about a convergence of whole bunch of stuff, more papers will be written about this:
1. **Evolutionary Merging**:
2. **BitNet Integration**:
4. **Experimental GrokAdamW Optimizer**:
## Prior work, from last week
Credits for optimizer go to [@cognitivecompai](https://github.com/cognitivecomputations/grokadamw) for laying the groundwork with the original GrokAdamW optimizer.
## LETS TRY OUT THE EXPERIMENTAL GROKKED FINETUNE:
```bash
wget https://huggingface.co/nisten/Biggie-SmoLlm-0.15B-Base/resolve/main/biggie_groked_int8_q8_0.gguf
```
Yes we will be talking with a 164mb file that runs at 160 tokens per second on a single cpu core
## you read all of that correctly yes, 1 cpu core 160 tps https://x.com/nisten/status/1819752034305970649
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6379683a81c1783a4a2ddba8/nTNISjByBkN7bJZzuOvOw.png)
## π run it with NO GPU and only one CPU core it with these settings
```bash
./llama-cli -n -1 -fa -b 512 -ctv q8_0 -ctk q8_0 -fa --min-p 0.3 --top-p 0.85 --keep -1 -p "You are a NASA JPL Scientists. Human: I want to bring my cat to mars." -m biggie_groked_int8_q8_0.gguf -co -cnv --in-prefix "<|im_start|>Human:" --reverse-prompt "Human:" -c 1024 -n 512 --temp 1.5 -ngl 0
```
## ποΈ Training Tutorial, MAKE YOUR OWN BIGGIE_SMOlLM
Clone the repo like you're stealing code from the future:
```bash
git clone https://github.com/nisten/grokadamw
cd grokadamw
```
Fire up the training script and watch the magic happen:
```bash
python smoltrainer.py
```
## π» Do it from scratch yourself
Install the secret sauce (dependencies):
```bash
pip install torch transformers datasets tqdm
```
make a file named meow.py , copy paste in this code, and then run it ```python meow.py```
```python
import torch
import torch.nn as nn
import logging
from datasets import load_dataset, Dataset
from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from torch.cuda.amp import autocast
import warnings
from tqdm import tqdm
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
MODEL_NAME = "nisten/Biggie-SmoLlm-0.15B-Base"
MAX_LENGTH = 2048
BATCH_SIZE = 8
LEARNING_RATE = 2e-4
MAX_STEPS = 3000
GRADIENT_ACCUMULATION_STEPS = 2
NUM_WARMUP_STEPS = 30
OUTPUT_DIR = "./capybara_finetuned_results"
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
class GrokAdamW(torch.optim.Optimizer):
def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=1e-2,
alpha_init=0.98, lamb=2.0, gamma=0.1, grokking_signal_fns=None,
grokking_signal_decay_rate=0.1, gradient_clipping=1.0):
defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay,
alpha_init=alpha_init, lamb=lamb, gamma=gamma,
grokking_signal_fns=grokking_signal_fns,
grokking_signal_decay_rate=grokking_signal_decay_rate,
gradient_clipping=gradient_clipping)
super(GrokAdamW, self).__init__(params, defaults)
@torch.no_grad()
def step(self, closure=None):
loss = None
if closure is not None:
with torch.enable_grad():
loss = closure()
for group in self.param_groups:
grokking_signal = self._compute_grokking_signal(group)
for i, p in enumerate(group['params']):
if p.grad is None:
continue
grad = p.grad
if group['gradient_clipping'] > 0:
grad = torch.clamp(grad, -group['gradient_clipping'], group['gradient_clipping'])
state = self.state[p]
if len(state) == 0:
state['step'] = 0
state['exp_avg'] = torch.zeros_like(p, memory_format=torch.preserve_format)
state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
state['grok_ema'] = torch.zeros_like(p, memory_format=torch.preserve_format)
exp_avg, exp_avg_sq, grok_ema = state['exp_avg'], state['exp_avg_sq'], state['grok_ema']
beta1, beta2 = group['betas']
state['step'] += 1
layer_beta1 = beta1 * (1 - group['gamma'])**i
alpha = group['alpha_init'] * torch.exp(torch.tensor(-group['grokking_signal_decay_rate'] * grokking_signal))
grok_ema.mul_(alpha).add_(grad, alpha=1 - alpha)
grok_grad = grad + group['lamb'] * grok_ema
exp_avg.mul_(layer_beta1).add_(grok_grad, alpha=1 - layer_beta1)
exp_avg_sq.mul_(beta2).addcmul_(grok_grad, grok_grad, value=1 - beta2)
denom = exp_avg_sq.sqrt().add_(group['eps'])
step_size = group['lr']
if group['weight_decay'] != 0:
p.data.mul_(1 - group['lr'] * group['weight_decay'])
p.addcdiv_(exp_avg, denom, value=-step_size)
return loss
def _compute_grokking_signal(self, group):
if group['grokking_signal_fns'] is None:
return 0.0
signals = []
for fn in group['grokking_signal_fns']:
try:
signal = fn()
if signal is not None:
signals.append(signal)
except Exception as e:
logger.warning(f"Error in grokking_signal_fn: {e}. Ignoring this function.")
if not signals:
return 0.0
return sum(signals) / len(signals)
def format_capybara_prompts(examples):
texts = []
for conversation in examples['conversation']:
formatted_text = ""
for turn in conversation:
if 'input' in turn:
formatted_text += f"Human: {turn['input']}\n\n"
if 'output' in turn:
formatted_text += f"Assistant: {turn['output']}\n\n"
texts.append(formatted_text.strip())
return {"text": texts}
class CustomTrainer(Trainer):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.grokking_signal = 0.0
def compute_loss(self, model, inputs, return_outputs=False):
labels = inputs.pop("labels")
outputs = model(**inputs)
logits = outputs.logits
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
loss_fct = nn.CrossEntropyLoss()
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
return (loss, outputs) if return_outputs else loss
def training_step(self, model, inputs):
model.train()
inputs = self._prepare_inputs(inputs)
with autocast(dtype=torch.bfloat16):
loss = self.compute_loss(model, inputs)
if self.args.gradient_accumulation_steps > 1:
loss = loss / self.args.gradient_accumulation_steps
loss.backward()
self.grokking_signal = loss.item()
return loss.detach()
def grokking_signal_fn():
return trainer.grokking_signal
def main():
logger.info(f"π Initializing {MODEL_NAME} finetuning with GrokAdamW")
try:
config = AutoConfig.from_pretrained(MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=torch.bfloat16)
except Exception as e:
logger.error(f"β Failed to load model or tokenizer: {str(e)}")
return
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id
logger.info("π Loading Capybara dataset")
try:
capybara_dataset = load_dataset("LDJnr/Capybara", split="train")
capybara_dataset = capybara_dataset.map(format_capybara_prompts, batched=True, remove_columns=capybara_dataset.column_names)
except Exception as e:
logger.error(f"β Failed to load Capybara dataset: {str(e)}")
return
logger.info(f"π Capybara dataset size: {len(capybara_dataset)}")
def tokenize_function(examples):
return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=MAX_LENGTH)
logger.info("π’ Tokenizing dataset")
tokenized_dataset = capybara_dataset.map(tokenize_function, batched=True, remove_columns=capybara_dataset.column_names)
logger.info("ποΈ Setting up the training arguments")
training_args = TrainingArguments(
output_dir=OUTPUT_DIR,
num_train_epochs=3,
per_device_train_batch_size=BATCH_SIZE,
gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
learning_rate=LEARNING_RATE,
weight_decay=0.01,
bf16=True,
logging_steps=10,
save_steps=300,
save_total_limit=10,
dataloader_num_workers=4,
warmup_steps=NUM_WARMUP_STEPS,
gradient_checkpointing=True,
evaluation_strategy="steps",
eval_steps=300,
max_steps=MAX_STEPS,
fp16=False,
optim="adamw_hf",
lr_scheduler_type="cosine",
load_best_model_at_end=True,
metric_for_best_model="loss",
greater_is_better=False,
)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
optimizer = GrokAdamW(
model.parameters(),
lr=LEARNING_RATE,
betas=(0.9, 0.999),
eps=1e-8,
weight_decay=0.01,
alpha_init=0.98,
lamb=2.0,
gamma=0.1,
grokking_signal_fns=[grokking_signal_fn],
grokking_signal_decay_rate=0.1,
gradient_clipping=1.0
)
logger.info("πββοΈ Initializing Trainer with GrokAdamW")
global trainer
trainer = CustomTrainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
eval_dataset=tokenized_dataset.select(range(min(1000, len(tokenized_dataset)))),
data_collator=data_collator,
optimizers=(optimizer, None),
)
logger.info("π₯ Starting the training with GrokAdamW")
try:
trainer.train()
except Exception as e:
logger.error(f"β Training failed: {str(e)}")
return
logger.info("πΎ Saving the model")
try:
trainer.save_model(OUTPUT_DIR)
except Exception as e:
logger.error(f"β Failed to save model: {str(e)}")
logger.info("π Finetuning with GrokAdamW completed!")
if __name__ == "__main__":
main()
```
π Now go forth and train, accelerate that code!
> **Note:** You'll need about 14GB of VRAM. If you have 8GB, change to batch size 4.
Results will appear in `./capybara_finetuned_results`
---
### Author
**Nisten Tahiraj**
π’ [rakun.ai](https://rakun.ai)
π Toronto, Canada
---
Happy training!
<video controls autoplay muted src="https://cdn-uploads.huggingface.co/production/uploads/6379683a81c1783a4a2ddba8/WCLhKzZWbrLo8BETGaKvI.qt"></video> |