Rephrasing the Web A Recipe for Compute and Data-Efficient Language Modeling

Community Article Published February 22, 2024

In this paper[1], the authors introduce Web Rephrase Augmented Pre-training (WRAP), aimed at enhancing language model training efficiency by rephrasing web documents into styles like Wikipedia or question-answer formats. This approach addresses the challenges of learning from noisy, unstructured web data, which typically requires significant compute and data resources.

Method Overview

WRAP uses an instruction-tuned model to rephrase web documents into various styles, creating synthetic data. Here's an overview of the method:

WRAP overview

This method allows for efficient learning from a blend of real and synthetic data, significantly reducing the need for high-quality web data. The process involves prompting a pre-trained LLM to generate paraphrases, combining these with real data for model training.

Building on the observation that high-quality data, like Wikipedia, improves language modeling, WRAP employs a strategy to rephrase web documents into four distinct styles:

  • Easy - understandable even by a toddler

  • Medium - similar to Wikipedia articles

  • Hard - in terse and abstruse language

  • Q/A - in question-answering format

The prompts for each style are shown below:

Prompt templates for the 4 styles

By utilizing an instruction-tuned model, specifically Mistral-7B, WRAP generates synthetic data. WRAP then combines this synthetic data with real web data in a 1:1 ratio, incorporating both the diversity of internet content and the quality of structured rephrasing, thus enabling the model to learn from a rich dataset that balances informative content with the realistic messiness of web text.

Results

The application of WRAP on the C4 dataset resulted in approximately 3x faster pre-training and improved model perplexity by over 10% across various subsets of the Pile dataset.

C4 WRAP results

It also enhanced zero-shot question-answering accuracy across 13 tasks by more than 2%.

WRAP results on various tasks

Conclusion

WRAP demonstrates significant improvements in the efficiency and effectiveness of language model training by leveraging synthetic rephrases of web data. For more details, please consult the full paper.

Congrats to the authors for their work!

[1] Maini, Pratyush et al. “Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling.” ArXiv abs/2401.16380 (2024): n. pag.