arxiv:2412.09990

Small Language Model as Data Prospector for Large Language Model

Published on Dec 13, 2024

Authors:

Abstract

The quality of instruction data directly affects the performance of fine-tuned Large Language Models (LLMs). Previously, li2023one proposed NUGGETS, which identifies and selects high-quality quality data from a large dataset by identifying those individual instruction examples that can significantly improve the performance of different tasks after being learnt as one-shot instances. In this work, we propose SuperNUGGETS, an improved variant of NUGGETS optimised for efficiency and performance. Our SuperNUGGETS uses a small language model (SLM) instead of a large language model (LLM) to filter the data for outstanding one-shot instances and refines the predefined set of tests. The experimental results show that the performance of SuperNUGGETS only decreases by 1-2% compared to NUGGETS, but the efficiency can be increased by a factor of 58. Compared to the original NUGGETS, our SuperNUGGETS has a higher utility value due to the significantly lower resource consumption.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2412.09990 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2412.09990 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2412.09990 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.