Training data - a floom Collection

floom 's Collections

ShowAndTell-2025-01-30

ShowAndTell-2024-12-03

Coding

ICL

RL

Agents

NLU

RAG

Data Efficient Approaches

Personalization

sentence-transformer-models

Tool Use & more

Feedback Analysis

Memory

SSM

Efficient Serving/Inference

Synthetic Data Generation

Frontier research ideas

Training data

updated Jan 14

AutoMathText: Autonomous Data Selection with Language Models for Mathematical Texts

Paper • 2402.07625 • Published Feb 12, 2024 • 15
Rethinking Data Selection for Supervised Fine-Tuning

Paper • 2402.06094 • Published Feb 8, 2024 • 1
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

Paper • 2402.13064 • Published Feb 20, 2024 • 48
TnT-LLM: Text Mining at Scale with Large Language Models

Paper • 2403.12173 • Published Mar 18, 2024 • 20
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows

Paper • 2402.10379 • Published Feb 16, 2024 • 31
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

Paper • 2402.10176 • Published Feb 15, 2024 • 36
CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training

Paper • 2406.10670 • Published Jun 15, 2024 • 4
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models

Paper • 2408.02085 • Published Aug 4, 2024 • 19
PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment

Paper • 2410.13785 • Published Oct 17, 2024 • 19
Evaluating Sample Utility for Data Selection by Mimicking Model Weights

Paper • 2501.06708 • Published Jan 12 • 5