Dataset and Data processing - a Testerpce Collection

Testerpce 's Collections

Theory and Representation learning

Graph

Search

Self correction

Information_retrieval

Speech

Agent

MoE

RAG

State space LLM

Partial layer training LLMs

Math

Dataset and Data processing

Video understanding

Reinforcement learning

Dataset and Data processing

updated Mar 31

Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models

Paper • 2405.20541 • Published May 30, 2024 • 24
RedPajama: an Open Dataset for Training Large Language Models

Paper • 2411.12372 • Published Nov 19, 2024 • 56
Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback

Paper • 2503.22230 • Published Mar 28 • 44