This model uses reinforcement learning to train on the GSM8K dataset, generating reasoning chains and formatted outputs despite the dataset lacking intermediate steps. A reward function guides the model, prioritizing answer correctness and XML format adherence.

Training Details:

  • Dataset: GSM8K
  • Algorithm: GRPO
  • Hardware: Single NVIDIA GeForce RTX 3090 Ti
  • Training Duration: 250 epochs, ~48 minutes

image/png Limitations:

The output length limit(200) restricts the model's ability to generate complex reasoning chains, hindering observation of output length growth during training.

Example:

Which one is bigger? 9.11 or 9.8? image/png

This qwen2.5 model was trained 2x faster with Unsloth and Huggingface's TRL library.

Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Nagi-ovo/Qwen2.5-7B-Reasoning-Adapter

Base model

Qwen/Qwen2.5-7B
Adapter
(7)
this model

Dataset used to train Nagi-ovo/Qwen2.5-7B-Reasoning-Adapter

Collection including Nagi-ovo/Qwen2.5-7B-Reasoning-Adapter