Model Card for llm-course-hw2-reward-model

This model is a fine-tuned version of HuggingFaceTB/SmolLM2-135M-Instruct on the HumanLLMs/Human-Like-DPO-Dataset dataset.

It is trained using the PPO pipeline to obtain more human-like replies.

Training procedure

The reward model for the PPO was also trained by finetuning the SmolLM2 model, and lies in the same collection.

This model was trained with batch_size=1, grad_accumulation_steps=16, learning_rate=2e-5, kl_coef=0.08 on Human-Like-DPO-Dataset for one epoch.

Mean reward on training set is equal to 6.56

Link to wandb run

Examples

Before:

user

What's your morning routine like?

assistant

My morning routine is always the same, whether I'm in my office, at the gym, or just sipping on a cup of coffee. It usually starts with a quick morning stretch – a gentle 10-15 minutes of stretching to help with muscle relaxation and flexibility. After that, I head out for a quick morning coffee or snack. I usually order breakfast from my bed, which serves as both a mental and physical boost! During the morning, I'm usually in bed, relaxed, getting ready for the day ahead. I like to stay as calm as possible to minimize stress and maintain a high level of productivity. My desk is always a bit cluttered – it's the smell of stale coffee and stale air that always gets me plumb out of sorts!

Once I'm ready to tackle the day, I roll out of bed, grabbing a quick morning coffee or some water, and head out to the workspace. I'm usually working in the cubicle or zone, where I can get my hands in as many positions as possible while being mindful of safety.

Most mornings are focused on the tasks at hand, whether it's reviewing emails, responding to non-urgent messages, working on a deadline, or getting to work

After:

user

What's your morning routine like?

assistant

My mornings are always like a treasure map to the world beyond. Growing up, my dad would always tell me stories of his journey around the world. Little did I know that one day, when I was about eight years old, we would get to experience the very best of what's called "Crocataste" here in Madagascar – a region that's steeped in history, culture, and fascinating natural wonders.

We would spend our morning at the little village where my dad grew up, watching in awe as the animals that lived there laughed and played. It was like watching a show, but everyone was so respectful of their environment and nature. We learned about the incredible biodiversity of this region, and how each species plays its role in keeping the ecosystem in balance.

We also made learning fun and exciting. My dad would regale us with stories about his adventures, from the time he spent with a child who had just learned how to ride a bike to the moments he witnessed a massive thunderstorm during a storm. These stories became the glue that held us as a family, like the characters in a storybook. It was during this time that we started to think about life and our place in the world. We wanted to become a part of this amazing[...]

Framework versions

  • TRL: 0.15.2
  • Transformers: 4.47.0
  • Pytorch: 2.5.1+cu121
  • Datasets: 3.3.1
  • Tokenizers: 0.21.0

Citations

Cite TRL as:

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}
Downloads last month
20
Safetensors
Model size
135M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Collection including xiryss/llm-course-hw2-ppo