Making Gemma 3 think

Community Article Published March 13, 2025

Everybody and their dog is fine-tuning Gemma 3 today, so I thought I'd do a longer post on the tips and sharp edges I find. let's go!

  1. has to be install everything form main and nightly. this is what I'm working with to get unsloth and TRL running
git+https://github.com/huggingface/transformers@main
git+https://github.com/huggingface/trl.git@main
bitsandbytes
peft

plus this with --no-deps

git+https://github.com/unslothai/unsloth-zoo.git@nightly
git+https://github.com/unslothai/unsloth.git@nightly
  1. vision fine-tuning isn't available in TRL's GRPOTrainer, so stick to text datasets. but no need to load the model differently in transformers or Unsloth
from transformers import AutoModelForImageTextToText

model = AutoModelForImageTextToText.from_pretrained("google/gemma-3-4b-it)
  1. will brown's code to turn GSM8k into a reasoning dataset is a nice toy experiment https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb

  2. with a learning rate of 5e-6 rewards and loss stayed flat for the first 100 or so steps.

  3. so far none of my runs have undermined the outputs after 1 epoch. therefore, I'm mainly experimenting with bigger LoRA adapters.

from trl import GRPOConfig

training_args = GRPOConfig(
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 1,
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 1,
    num_generations = 2,
    max_prompt_length = 256,
    max_completion_length = 1024 - 256,
    num_train_epochs = 1,
    max_steps = 250,
    save_steps = 250,
    max_grad_norm = 0.1,
    report_to = "none",
)

if you want an introduction to GRPO, check out the reasoning course, it walks you through the algorithm, theory, and implementation in a smooth way.

https://huggingface.co/reasoning-course

Community

Sign up or log in to comment