Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks Paper • 2204.07705 • Published Apr 16, 2022 • 1
Knowledge-Driven CoT: Exploring Faithful Reasoning in LLMs for Knowledge-intensive Question Answering Paper • 2308.13259 • Published Aug 25, 2023 • 2
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning Paper • 2309.05653 • Published Sep 11, 2023 • 9
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models Paper • 2309.12284 • Published Sep 21, 2023 • 16
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages Paper • 2309.09400 • Published Sep 17, 2023 • 77
CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images Paper • 2310.16825 • Published Oct 25, 2023 • 28
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset Paper • 2303.03915 • Published Mar 7, 2023 • 5
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset Paper • 2309.04662 • Published Sep 9, 2023 • 21
SlimPajama-DC: Understanding Data Combinations for LLM Training Paper • 2309.10818 • Published Sep 19, 2023 • 10
Towards Effective Disambiguation for Machine Translation with Large Language Models Paper • 2309.11668 • Published Sep 20, 2023 • 1
Improving Translation Faithfulness of Large Language Models via Augmenting Instructions Paper • 2308.12674 • Published Aug 24, 2023 • 1
TinyStories: How Small Can Language Models Be and Still Speak Coherent English? Paper • 2305.07759 • Published May 12, 2023 • 28
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only Paper • 2306.01116 • Published Jun 1, 2023 • 28
KoSBi: A Dataset for Mitigating Social Bias Risks Towards Safer Large Language Model Application Paper • 2305.17701 • Published May 28, 2023 • 1
M^3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning Paper • 2306.04387 • Published Jun 7, 2023 • 6
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning Paper • 2301.13688 • Published Jan 31, 2023 • 8
PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance Paper • 2306.05443 • Published Jun 8, 2023 • 3
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages Paper • 2309.10661 • Published Sep 19, 2023 • 1
The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation Paper • 2305.06156 • Published May 9, 2023 • 1
Searching by Code: a New SearchBySnippet Dataset and SnippeR Retrieval Model for Searching by Code Snippets Paper • 2305.11625 • Published May 19, 2023 • 1
MVP: Multi-task Supervised Pre-training for Natural Language Generation Paper • 2206.12131 • Published Jun 24, 2022 • 1
The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI Paper • 2310.16787 • Published Oct 25, 2023 • 3
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs Paper • 2308.13387 • Published Aug 25, 2023 • 1
Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models Paper • 2307.14430 • Published Jul 26, 2023 • 3
FACT: Learning Governing Abstractions Behind Integer Sequences Paper • 2209.09543 • Published Sep 20, 2022 • 1
HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM Paper • 2311.09528 • Published Nov 16, 2023 • 2
Source Prompt: Coordinated Pre-training of Language Models on Diverse Corpora from Multiple Sources Paper • 2311.09732 • Published Nov 16, 2023 • 1
Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math Paper • 2312.17120 • Published Dec 28, 2023 • 24
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training Paper • 2401.00849 • Published Jan 1 • 14
Scientific and Creative Analogies in Pretrained Language Models Paper • 2211.15268 • Published Nov 28, 2022 • 1
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research Paper • 2402.00159 • Published Jan 31 • 55
Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning Paper • 2402.06619 • Published Feb 9 • 49
A Touch, Vision, and Language Dataset for Multimodal Alignment Paper • 2402.13232 • Published Feb 20 • 11
GECTurk: Grammatical Error Correction and Detection Dataset for Turkish Paper • 2309.11346 • Published Sep 20, 2023
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset Paper • 2402.10176 • Published Feb 15 • 33
MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels Paper • 2405.07526 • Published 19 days ago • 14