arxiv:2407.05463

Training Task Experts through Retrieval Based Distillation

Published on Jul 7

· Submitted by

luohy on Jul 9

Upvote

Authors:

Jiaxin Ge ,

Vijay Viswanathan ,

Hongyin Luo ,

Graham Neubig

Abstract

One of the most reliable ways to create deployable models for specialized tasks is to obtain an adequate amount of high-quality task-specific data. However, for specialized tasks, often such datasets do not exist. Existing methods address this by creating such data from large language models (LLMs) and then distilling such knowledge into smaller models. However, these methods are limited by the quality of the LLMs output, and tend to generate repetitive or incorrect data. In this work, we present Retrieval Based Distillation (ReBase), a method that first retrieves data from rich online sources and then transforms them into domain-specific data. This method greatly enhances data diversity. Moreover, ReBase generates Chain-of-Thought reasoning and distills the reasoning capacity of LLMs. We test our method on 4 benchmarks and results show that our method significantly improves performance by up to 7.8% on SQuAD, 1.37% on MNLI, and 1.94% on BigBench-Hard.

View arXiv page View PDF Add to collection

Community

luohy

Paper author Paper submitter Jul 9

Need to create high-quality, task-specific datasets but don’t have any existing datasets? Introducing ReBase! Our method retrieves diverse data samples from multiple datasets and transforms them to fit your needs. ReBase Allows you to:

Create diverse, high-quality data for your needs.
Own your task-expert models without relying on APIs.
Less hallucination, create your data with grounded information.

ReBase gives you better distilled task-specific models! Compared to existing methods, ReBase has the following advantages:

ReBase enhances data diversity and difficulty! For example, on MCoNaLa (Given a Japanese instruction, generate a code), ReBase generates harder instructions and programs.
ReBase effectively trains specialized task-expert models. 🔍It outperforms existing data generation methods by up to 7.8% on SQuAD, 1.37% on MNLI, and 1.94% on BBH.
Instead of retrieving entire datasets, retrieving individual data points is more flexible; on each task, ReBase draws from * >20 * distinct datasets!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.05463 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.05463 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.05463 in a Space README.md to link it from this page.