Papers
arxiv:2412.09990

Small Language Model as Data Prospector for Large Language Model

Published on Dec 13, 2024
Authors:
,
,
,
,
,

Abstract

The quality of instruction data directly affects the performance of fine-tuned Large Language Models (LLMs). Previously, li2023one proposed NUGGETS, which identifies and selects high-quality quality data from a large dataset by identifying those individual instruction examples that can significantly improve the performance of different tasks after being learnt as one-shot instances. In this work, we propose SuperNUGGETS, an improved variant of NUGGETS optimised for efficiency and performance. Our SuperNUGGETS uses a small language model (SLM) instead of a large language model (LLM) to filter the data for outstanding one-shot instances and refines the predefined set of tests. The experimental results show that the performance of SuperNUGGETS only decreases by 1-2% compared to NUGGETS, but the efficiency can be increased by a factor of 58. Compared to the original NUGGETS, our SuperNUGGETS has a higher utility value due to the significantly lower resource consumption.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2412.09990 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2412.09990 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2412.09990 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.