stereoplegic
's Collections
Dataset pruning/cleaning/dedup
updated
AlpaGasus: Training A Better Alpaca with Fewer Data
Paper
•
2307.08701
•
Published
•
22
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Paper
•
2303.03915
•
Published
•
6
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
Paper
•
2309.04662
•
Published
•
22
SlimPajama-DC: Understanding Data Combinations for LLM Training
Paper
•
2309.10818
•
Published
•
10
When Less is More: Investigating Data Pruning for Pretraining LLMs at
Scale
Paper
•
2309.04564
•
Published
•
15
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages
Paper
•
2309.09400
•
Published
•
84
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora
with Web Data, and Web Data Only
Paper
•
2306.01116
•
Published
•
32
Self-Alignment with Instruction Backtranslation
Paper
•
2308.06259
•
Published
•
41
NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient
Framework
Paper
•
2111.04130
•
Published
•
1
Magicoder: Source Code Is All You Need
Paper
•
2312.02120
•
Published
•
80
LLM360: Towards Fully Transparent Open-Source LLMs
Paper
•
2312.06550
•
Published
•
57
Automated Data Curation for Robust Language Model Fine-Tuning
Paper
•
2403.12776
•
Published
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small
Reference Models
Paper
•
2405.20541
•
Published
•
21
CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language
Model Pre-training
Paper
•
2406.10670
•
Published
•
4
DataComp-LM: In search of the next generation of training sets for
language models
Paper
•
2406.11794
•
Published
•
50