gemma3-4b-thinking
This model is a fine-tuned version of google/gemma-3-4b-it trained to enhance its reasoning and step-by-step thinking capabilities. It has been trained using TRL with GRPO (Generative Reinforcement Learning from Policy Optimization).
Model Description
This model was specifically tuned to demonstrate step-by-step reasoning when solving problems, particularly mathematical word problems. The training process used reinforcement learning to reward the model for:
- Providing clear reasoning steps
- Using logical deduction
- Arriving at the correct numerical answer
Quick start
from transformers import pipeline, AutoProcessor
# Load the model and processor
processor = AutoProcessor.from_pretrained("real-jiakai/gemma3-4b-thinking")
generator = pipeline("text-generation", model="real-jiakai/gemma3-4b-thinking", tokenizer=processor.tokenizer)
# Example math problem
question = "The school principal decided that she wanted every class to have an equal number of boys and girls in each first-grade classroom. There are 4 classrooms. There are 56 boys and 44 girls. How many total students are in each classroom?"
# Format the input with chat template
input_text = processor.apply_chat_template([{"role": "user", "content": question}])
# Generate response with reasoning
output = generator(input_text, max_new_tokens=1024)
print(output[0]["generated_text"])
Model Performance
The model demonstrates enhanced reasoning capabilities compared to the base model, particularly for:
- Mathematical word problems
- Step-by-step logical deduction
- Breaking complex problems into solvable components
Training Procedure
This model was trained with GRPO, a method introduced in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.
Training Details
- Dataset: GSM8k (Grade School Math 8k), a dataset of diverse grade school math word problems
- Fine-tuning Method: GRPO (Generative Reinforcement Learning from Policy Optimization)
- Training Steps: 100
- Batch Size: 2
- Learning Rate: 5e-6
- Hardware: A100 80GB GPU
- Parameter-Efficient Fine-Tuning: Used LoRA with r=16, alpha=32
Reward Functions
The training used multiple reward functions to guide the model:
- Correctness of final answer
- Using proper numerical formats
- Demonstrating clear reasoning steps
- Following structured formats
Framework versions
- TRL: 0.16.0.dev0
- Transformers: 4.50.0.dev0
- Pytorch: 2.6.0
- Datasets: 3.3.2
- Tokenizers: 0.21.1
Limitations
- The model sometimes reverts to its base output format rather than following the structured reasoning format used during training
- Performance may vary across different types of problems
- The model is primarily optimized for mathematical reasoning and may not show the same level of improvement on other tasks
Ethics and Responsible Use
- This model is intended to demonstrate reasoning capabilities and should not be used as a sole solution for educational assessments
- Users should verify mathematical results independently for critical applications
- The model can still make reasoning errors despite showing its work
Citations
@article{gemma_2025,
title={Gemma 3},
url={https://goo.gle/Gemma3Report},
publisher={Kaggle},
author={Gemma Team},
year={2025}
}
@article{shao2024deepseekmath,
title={Deepseekmath: Pushing the limits of mathematical reasoning in open language models},
author={Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Y and others},
journal={arXiv preprint arXiv:2402.03300},
year={2024}
}
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no pipeline_tag.