mapama247
's Collections
Synthetic Data Generation
updated
Textbooks Are All You Need
Paper
•
2306.11644
•
Published
•
142
Textbooks Are All You Need II: phi-1.5 technical report
Paper
•
2309.05463
•
Published
•
87
TinyStories: How Small Can Language Models Be and Still Speak Coherent
English?
Paper
•
2305.07759
•
Published
•
33
Scaling Synthetic Data Creation with 1,000,000,000 Personas
Paper
•
2406.20094
•
Published
•
95
Instruction Pre-Training: Language Models are Supervised Multitask
Learners
Paper
•
2406.14491
•
Published
•
86
Improving Text Embeddings with Large Language Models
Paper
•
2401.00368
•
Published
•
79
Enhancing Chat Language Models by Scaling High-quality Instructional
Conversations
Paper
•
2305.14233
•
Published
•
6
Magicoder: Source Code Is All You Need
Paper
•
2312.02120
•
Published
•
80
Adapting Large Language Models via Reading Comprehension
Paper
•
2309.09530
•
Published
•
77
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language
Models
Paper
•
2401.01335
•
Published
•
64
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs
with Nothing
Paper
•
2406.08464
•
Published
•
65
WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with
Refined Data Generation
Paper
•
2312.14187
•
Published
•
49
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language
Modeling
Paper
•
2401.16380
•
Published
•
48
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for
Language Models
Paper
•
2402.13064
•
Published
•
47
AgentInstruct: Toward Generative Teaching with Agentic Flows
Paper
•
2407.03502
•
Published
•
48
Toward General Instruction-Following Alignment for Retrieval-Augmented
Generation
Paper
•
2410.09584
•
Published
•
47
Self-Alignment with Instruction Backtranslation
Paper
•
2308.06259
•
Published
•
41
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset
Paper
•
2402.10176
•
Published
•
36
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM
Workflows
Paper
•
2402.10379
•
Published
•
30
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper
•
2404.07503
•
Published
•
29
Beyond Human Data: Scaling Self-Training for Problem-Solving with
Language Models
Paper
•
2312.06585
•
Published
•
28
Becoming self-instruct: introducing early stopping criteria for minimal
instruct tuning
Paper
•
2307.03692
•
Published
•
25
AlpaGasus: Training A Better Alpaca with Fewer Data
Paper
•
2307.08701
•
Published
•
22
Simple synthetic data reduces sycophancy in large language models
Paper
•
2308.03958
•
Published
•
21
CodecLM: Aligning Language Models with Tailored Synthetic Data
Paper
•
2404.05875
•
Published
•
16
Source2Synth: Synthetic Data Generation and Curation Grounded in Real
Data Sources
Paper
•
2409.08239
•
Published
•
16
WizardLM: Empowering Large Language Models to Follow Complex
Instructions
Paper
•
2304.12244
•
Published
•
13
Learning to Generate Instruction Tuning Datasets for Zero-Shot Task
Adaptation
Paper
•
2402.18334
•
Published
•
12
Synthesizing Text-to-SQL Data from Weak and Strong LLMs
Paper
•
2408.03256
•
Published
•
10
Self-Instruct: Aligning Language Model with Self Generated Instructions
Paper
•
2212.10560
•
Published
•
9
Ensemble-Instruct: Generating Instruction-Tuning Data with a
Heterogeneous Mixture of LMs
Paper
•
2310.13961
•
Published
•
4
STaR: Bootstrapping Reasoning With Reasoning
Paper
•
2203.14465
•
Published
•
8
M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in
Large Language Models
Paper
•
2406.16783
•
Published
•
4
Synthetic Data Generation with Large Language Models for Text
Classification: Potential and Limitations
Paper
•
2310.07849
•
Published
•
2
Explore-Instruct: Enhancing Domain-Specific Instruction Coverage through
Active Exploration
Paper
•
2310.09168
•
Published
•
2
Increasing Diversity While Maintaining Accuracy: Text Data Generation
with Large Language Models and Human Interventions
Paper
•
2306.04140
•
Published
•
2
SALMON: Self-Alignment with Principle-Following Reward Models
Paper
•
2310.05910
•
Published
•
2
Better Synthetic Data by Retrieving and Transforming Existing Datasets
Paper
•
2404.14361
•
Published
•
1
Impossible Distillation: from Low-Quality Model to High-Quality Dataset
& Model for Summarization and Paraphrasing
Paper
•
2305.16635
•
Published
•
1
Arena Learning: Build Data Flywheel for LLMs Post-training via Simulated
Chatbot Arena
Paper
•
2407.10627
•
Published
•
1
ZeroGen: Efficient Zero-shot Learning via Dataset Generation
Paper
•
2202.07922
•
Published
•
1
West-of-N: Synthetic Preference Generation for Improved Reward Modeling
Paper
•
2401.12086
•
Published
•
1
Automatic Instruction Evolving for Large Language Models
Paper
•
2406.00770
•
Published
•
2
Generative AI for Synthetic Data Generation: Methods, Challenges and the
Future
Paper
•
2403.04190
•
Published
On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A
Survey
Paper
•
2406.15126
•
Published
Large Language Models for Data Annotation: A Survey
Paper
•
2402.13446
•
Published
Large Language Model as Attributed Training Data Generator: A Tale of
Diversity and Bias
Paper
•
2306.15895
•
Published
A Multi-Faceted Evaluation Framework for Assessing Synthetic Data
Generated by Large Language Models
Paper
•
2404.14445
•
Published
TarGEN: Targeted Data Generation with Large Language Models
Paper
•
2310.17876
•
Published
#InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of
Large Language Models
Paper
•
2308.07074
•
Published
Self-Rewarding Language Models
Paper
•
2401.10020
•
Published
•
145
Orca 2: Teaching Small Language Models How to Reason
Paper
•
2311.11045
•
Published
•
71
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
Paper
•
2306.02707
•
Published
•
46
WizardCoder: Empowering Code Large Language Models with Evol-Instruct
Paper
•
2306.08568
•
Published
•
28
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
Paper
•
2309.11998
•
Published
•
25
Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models
Paper
•
2310.13671
•
Published
•
18
Self-play with Execution Feedback: Improving Instruction-following
Capabilities of Large Language Models
Paper
•
2406.13542
•
Published
•
16
Auto-Instruct: Automatic Instruction Generation and Ranking for
Black-Box Language Models
Paper
•
2310.13127
•
Published
•
11
WizardMath: Empowering Mathematical Reasoning for Large Language Models
via Reinforced Evol-Instruct
Paper
•
2308.09583
•
Published
•
7
GenQA: Generating Millions of Instructions from a Handful of Prompts
Paper
•
2406.10323
•
Published
•
5
UltraFeedback: Boosting Language Models with High-quality Feedback
Paper
•
2310.01377
•
Published
•
5
Model Dementia: Generated Data Makes Models Forget
Paper
•
2305.17493
•
Published
•
4
Large Language Model as a User Simulator
Paper
•
2308.11534
•
Published
•
2
Unnatural Instructions: Tuning Language Models with (Almost) No Human
Labor
Paper
•
2212.09689
•
Published
•
1
Aligning Large Language Models through Synthetic Feedback
Paper
•
2305.13735
•
Published
•
1
Principle-Driven Self-Alignment of Language Models from Scratch with
Minimal Human Supervision
Paper
•
2305.03047
•
Published
•
1
Mixture of Soft Prompts for Controllable Data Generation
Paper
•
2303.01580
•
Published
•
1
Refined Direct Preference Optimization with Synthetic Data for
Behavioral Alignment of LLMs
Paper
•
2402.08005
•
Published
•
1
Harnessing the Power of David against Goliath: Exploring Instruction
Data Generation without Using Closed-Source Models
Paper
•
2308.12711
•
Published
•
1
Generating Training Data with Language Models: Towards Zero-Shot
Language Understanding
Paper
•
2202.04538
•
Published
Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data
Generation with Large Language Models
Paper
•
2311.00287
•
Published
GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation
Paper
•
2104.08826
•
Published
Synthetic Prompting: Generating Chain-of-Thought Demonstrations for
Large Language Models
Paper
•
2302.00618
•
Published
MIND: Math Informed syNthetic Dialogues for Pretraining LLMs
Paper
•
2410.12881
•
Published
•
1
LAB: Large-Scale Alignment for ChatBots
Paper
•
2403.01081
•
Published
Large Language Models Can Self-Improve
Paper
•
2210.11610
•
Published
Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation
Paper
•
2305.14327
•
Published
Automatically Generating Numerous Context-Driven SFT Data for LLMs
across Diverse Granularity
Paper
•
2405.16579
•
Published
Data Augmentation using Pre-trained Transformer Models
Paper
•
2003.02245
•
Published
Unsupervised Neural Machine Translation with Generative Language Models
Only
Paper
•
2110.05448
•
Published
Instruction Tuning with GPT-4
Paper
•
2304.03277
•
Published
Content preserving text generation with attribute controls
Paper
•
1811.01135
•
Published
Large Language Models Are Human-Level Prompt Engineers
Paper
•
2211.01910
•
Published
•
1