Enable Language Models to Implicitly Learn Self-Improvement From Data

Published on Oct 2, 2023
· Featured in Daily Papers on Oct 3, 2023
Le Hou ,


Large Language Models (LLMs) have demonstrated remarkable capabilities in open-ended text generation tasks. However, the inherent open-ended nature of these tasks implies that there is always room for improvement in the quality of model responses. To address this challenge, various approaches have been proposed to enhance the performance of LLMs. There has been a growing focus on enabling LLMs to self-improve their response quality, thereby reducing the reliance on extensive human annotation efforts for collecting diverse and high-quality training data. Recently, prompting-based methods have been widely explored among self-improvement methods owing to their effectiveness, efficiency, and convenience. However, those methods usually require explicitly and thoroughly written rubrics as inputs to LLMs. It is expensive and challenging to manually derive and provide all necessary rubrics with a real-world complex goal for improvement (e.g., being more helpful and less harmful). To this end, we propose an ImPlicit Self-ImprovemenT (PIT) framework that implicitly learns the improvement goal from human preference data. PIT only requires preference data that are used to train reward models without extra human efforts. Specifically, we reformulate the training objective of reinforcement learning from human feedback (RLHF) -- instead of maximizing response quality for a given input, we maximize the quality gap of the response conditioned on a reference response. In this way, PIT is implicitly trained with the improvement goal of better aligning with human preferences. Experiments on two real-world datasets and one synthetic dataset show that our method significantly outperforms prompting-based methods.


My summary: LLMs keep getting more capable at generating natural language. But there's always room for improving the quality and alignment of their responses.

Typically this requires lots of human effort to collect more training data. So researchers are exploring ways for models to self-improve without human involvement.

Many methods use prompting - giving the LLM instructions to critique and refine its responses. But coming up with comprehensive prompts is challenging.

The new approach PIT lets models learn self-improvement implicitly from human preference data instead. It reformulates reinforcement learning to maximize the gap between an original response and improved response conditioned on the original.

This taps into the implicit guidance in the preference data on what constitutes better quality, so no manual rubrics are needed. PIT uses curriculum reinforcement learning - first improving easy references, then switching to the LLM's own samples.

Experiments on real and synthetic datasets show PIT significantly outperforms prompting methods like Self-Refine.

It improved response quality 7-34% across conditions without any human involvement.

This demonstrates a promising direction for LLMs to align better with human preferences autonomously as they learn from experience. No need for human bottlenecks when expanding to new domains or underserved use cases. Very cool!

**TLDR: New method PIT enables LLMs to implicitly learn to refine themselves from human feedback, no prompts needed. Big improvement over prompting approaches.

Full Summary

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite in a model to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite in a dataset to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite in a Space to link it from this page.

Collections including this paper 11