stereoplegic
's Collections
Dataset pruning/cleaning/dedup
updated
AlpaGasus: Training A Better Alpaca with Fewer Data
Paper
•
2307.08701
•
Published
•
21
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Paper
•
2303.03915
•
Published
•
5
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
Paper
•
2309.04662
•
Published
•
21
SlimPajama-DC: Understanding Data Combinations for LLM Training
Paper
•
2309.10818
•
Published
•
10
When Less is More: Investigating Data Pruning for Pretraining LLMs at
Scale
Paper
•
2309.04564
•
Published
•
14
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages
Paper
•
2309.09400
•
Published
•
77
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora
with Web Data, and Web Data Only
Paper
•
2306.01116
•
Published
•
27
Self-Alignment with Instruction Backtranslation
Paper
•
2308.06259
•
Published
•
37
NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient
Framework
Paper
•
2111.04130
•
Published
•
1
Magicoder: Source Code Is All You Need
Paper
•
2312.02120
•
Published
•
78
LLM360: Towards Fully Transparent Open-Source LLMs
Paper
•
2312.06550
•
Published
•
52