--- library_name: transformers tags: [] --- # Model Card for _Qwen2.5-0.5B-Instruct (Fine-Tuned on OpenR1-Math-220k, 2% Done, 50% underway Feb 13th)_ ## Model Details **Model Name**: Qwen2.5-0.5B-Instruct (GRPO Fine-Tuned) **Model ID**: `_Qwen2.5-0.5B-R1subset_` **License**: [Apache 2.0 / or whichever applies] **Finetuned From**: [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) **Language(s)**: English (mathematical text) **Developed By**: Christian H. Cooper **Funding**: Self-sponsored **Shared By**: Christian H. Cooper ### Model Description This model is a **Qwen2.5-0.5B** base LLM fine-tuned on a **2% subset** of the [OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k) dataset. I used **Group Relative Policy Optimization (GRPO)** from the `trl` library, guiding the model toward producing well-formatted chain-of-thought answers in: ``` ... ... ``` It focuses on math reasoning tasks, learning to generate a step-by-step solution (``) and a numeric or final textual answer (``). We incorporate reward functions that encourage correct chain-of-thought structure, numeric answers, and correctness. ### Model Sources - **GitHub or Repo**: *[Pending]* - **Paper/Demo**: *[Pending]* ## Uses ### Direct Use - **Math Problem Solving**: The model tries to reason through math word problems, providing step-by-step reasoning and a final answer. ### Downstream Use - **Educational Tools**: Potentially used in tutoring or step-by-step solution generation. - **Math Chatbots**: A math helper that can respond in a structured `/` format. ### Out-of-Scope Use - **High-Stakes Decisions**: Model is not guaranteed to be correct for advanced or critical math scenarios (finance, medical, engineering safety). - **Non-English**: Primary training data is English math text, so reliability in other languages is minimal. ## Bias, Risks, and Limitations - **Bias**: Although this is a math-focused dataset, any language model can exhibit unintended biases. - **Risks**: The model may produce mathematically incorrect or incomplete solutions. The partial coverage (2% of the dataset) further limits accuracy. - **Limitations**: - Only partially fine-tuned on 2% of the data, so correctness is not guaranteed. - The chain-of-thought is for interpretability but may still contain flawed reasoning or leaps. ## How to Get Started ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_name = "HarleyCooper/Qwen.5B-OpenR1Math" # Will keep the same name through all % iterations. tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda") prompt = """ Question: It is known that in a convex $n$-gon ($n>3$) no three diagonals pass through the same point. Find the number of points (distinct from the vertices) of intersection of pairs of diagonals. """ inputs = tokenizer(prompt, return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=2000) answer = tokenizer.decode(outputs[0]) print(answer) ``` ## Training Details ### Training Data - **Dataset**: A 2% subsample (~4.4k problems) of [OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k). - **Data Format**: Each sample has `problem`, `solution`, `answer`. We transform them into: - `"prompt"`: A single string containing system instructions + the problem text. - `"answer"`: A string with `` + `` blocks. ### Training Procedure - **Framework**: [TRL (v0.4+)](https://github.com/lvwerra/trl) with Group Relative Policy Optimization (GRPO). - **Objective**: Reinforcement learning on chain-of-thought format, numeric correctness, and final-answer consistency. - **Reward Functions**: 1. **`xmlcount_reward_func`**: Encourages ``/`` structure. 2. **`soft_format_reward_func`**: Checks for `.*.*` in any multiline arrangement. 3. **`strict_format_reward_func`**: Strict multiline regex for exact formatting. 4. **`int_reward_func`**: Partial reward if the final `` is purely numeric. 5. **`correctness_reward_func`**: Binary reward if the final extracted answer exactly matches the known correct answer. #### Training Hyperparameters - **Base Model**: Qwen2.5-0.5B - **Learning Rate**: ~5e-6 - **Batch Size**: 1–2 (due to GPU constraints) - **Optimizer**: AdamW (β1=0.9, β2=0.99) - **Scheduler**: Cosine with warmup_ratio=0.1 - **Num Generations**: 16 (GRPO config) - **Number of Training Epochs**: 1 epoch on 2% data - **Hardware**: Single A100 40GB on Colab - **Max Prompt Length**: 256 tokens - **Max Completion Length**: 200 tokens ### Speeds, Sizes, Times - **Approx. Steps**: ~200–300 steps for 2% subset - **Run Time**: Varies from ~1 to 2 hours on Colab A100 ## Evaluation ### Testing Data - Currently trained + tested on the same subset (2%). Next step would be to evaluate on a withheld portion or the full set to measure true correctness. ### Metrics - **Format Rewards**: `xmlcount`, `soft_format`, `strict_format` - **Correctness**: Exact match final numeric/string answer - **Partial Numeric**: `int_reward_func` ### Results - The model shows a strong improvement in output format (70–80% format compliance) but relatively low exact numeric correctness. Additional epochs or a larger training fraction are needed for better correctness. ## Environmental Impact - **Hardware**: Single A100 40GB GPU in a Colab environment - **Train Time**: ~1–2 hours on 2% data - **Carbon Footprint**: Not measured exactly, but minimal compared to large-scale runs ## Model Architecture & Objective - **Architecture**: Transformer-based causal language model (Qwen2.5-0.5B) - **Objective**: RL-based chain-of-thought generation for math reasoning ## Citation ``` @misc{cooperQwen2.5-0.5B, title={Qwen2.5-0.5B Fine-Tuned on OpenR1 (2% subset)}, author={Christian H. Cooper.}, howpublished={\url{https://huggingface.co/Christian-cooper-us/Qwen2.5-0.5B-R1subset}}, year={2025}, } ``` ## Contact - Maintainers: Christian Cooper (GitHub: [@christian-cooper-us](https://huggingface.co/HarleyCooper)), others. --- **Disclaimer**: This model is experimental, trained on only 2% of the dataset. It may produce inaccurate math solutions and is not suitable for high-stakes or time-sensitive deployments.