From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning Paper • 2308.12032 • Published Aug 23, 2023 • 1
Know thy corpus! Robust methods for digital curation of Web corpora Paper • 2003.06389 • Published Mar 13, 2020 • 1
The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation Paper • 2305.06156 • Published May 9, 2023 • 1
End-to-end Knowledge Retrieval with Multi-modal Queries Paper • 2306.00424 • Published Jun 1, 2023 • 1
SPRING: GPT-4 Out-performs RL Algorithms by Studying Papers and Reasoning Paper • 2305.15486 • Published May 24, 2023 • 1
Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression Paper • 2306.15063 • Published Jun 26, 2023 • 1
NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework Paper • 2111.04130 • Published Nov 7, 2021 • 1
Oasis: Data Curation and Assessment System for Pretraining of Large Language Models Paper • 2311.12537 • Published Nov 21, 2023 • 1
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models Paper • 2403.00231 • Published Mar 1 • 1
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models Paper • 2402.03300 • Published Feb 5 • 66
Better Synthetic Data by Retrieving and Transforming Existing Datasets Paper • 2404.14361 • Published Apr 22 • 1
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach Paper • 2405.15613 • Published May 24 • 12
SemCoder: Training Code Language Models with Comprehensive Semantics Paper • 2406.01006 • Published 30 days ago