LookAhead Tuning: Safer Language Models via Partial Answer Previews
Abstract
Fine-tuning enables large language models (LLMs) to adapt to specific domains, but often undermines their previously established safety alignment. To mitigate the degradation of model safety during fine-tuning, we introduce LookAhead Tuning, which comprises two simple, low-resource, and effective data-driven methods that modify training data by previewing partial answer prefixes. Both methods aim to preserve the model's inherent safety mechanisms by minimizing perturbations to initial token distributions. Comprehensive experiments demonstrate that LookAhead Tuning effectively maintains model safety without sacrificing robust performance on downstream tasks. Our findings position LookAhead Tuning as a reliable and efficient solution for the safe and effective adaptation of LLMs. Code is released at https://github.com/zjunlp/LookAheadTuning.
Community
This paper introduces LookAhead Tuning, which comprises two simple, low-resource, and effective data-driven methods that modify training data by previewing partial answer prefixes.
Both methods aim to preserve the model's inherent safety mechanisms by minimizing perturbations to initial token distributions.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Picky LLMs and Unreliable RMs: An Empirical Study on Safety Alignment after Instruction Tuning (2025)
- Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models (2025)
- Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation (2025)
- Safety Alignment Depth in Large Language Models: A Markov Chain Perspective (2025)
- Federated Fine-Tuning of Large Language Models: Kahneman-Tversky vs. Direct Preference Optimization (2025)
- How Much Do Code Language Models Remember? An Investigation on Data Extraction Attacks before and after Fine-tuning (2025)
- MetaSC: Test-Time Safety Specification Optimization for Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Picky LLMs and Unreliable RMs: An Empirical Study on Safety Alignment after Instruction Tuning (2025)
- Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models (2025)
- Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation (2025)
- Reasoning with Reinforced Functional Token Tuning (2025)
- Safety Alignment Depth in Large Language Models: A Markov Chain Perspective (2025)
- Federated Fine-Tuning of Large Language Models: Kahneman-Tversky vs. Direct Preference Optimization (2025)
- How Much Do Code Language Models Remember? An Investigation on Data Extraction Attacks before and after Fine-tuning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper