-
Phi-4 Technical Report
Paper • 2412.08905 • Published • 119 -
Evaluating and Aligning CodeLLMs on Human Preference
Paper • 2412.05210 • Published • 51 -
Evaluating Language Models as Synthetic Data Generators
Paper • 2412.03679 • Published • 49 -
Yi-Lightning Technical Report
Paper • 2412.01253 • Published • 29
Collections
Discover the best community collections!
Collections including paper arxiv:2404.07503
-
AgentInstruct: Toward Generative Teaching with Agentic Flows
Paper • 2407.03502 • Published • 51 -
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing
Paper • 2406.08464 • Published • 70 -
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Paper • 2404.14219 • Published • 257 -
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows
Paper • 2402.10379 • Published • 32
-
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper • 2404.07503 • Published • 32 -
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
Paper • 2404.03715 • Published • 62 -
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing
Paper • 2406.08464 • Published • 70 -
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models
Paper • 2402.13064 • Published • 50
-
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
Paper • 2401.16380 • Published • 51 -
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper • 2404.07503 • Published • 32 -
WizardLM: Empowering Large Language Models to Follow Complex Instructions
Paper • 2304.12244 • Published • 14 -
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models
Paper • 2402.13064 • Published • 50
-
Textbooks Are All You Need
Paper • 2306.11644 • Published • 145 -
Textbooks Are All You Need II: phi-1.5 technical report
Paper • 2309.05463 • Published • 87 -
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Paper • 2305.07759 • Published • 36 -
Scaling Synthetic Data Creation with 1,000,000,000 Personas
Paper • 2406.20094 • Published • 102
-
A Systematic Survey of Text Summarization: From Statistical Methods to Large Language Models
Paper • 2406.11289 • Published • 5 -
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper • 2404.07503 • Published • 32 -
Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models
Paper • 2407.12327 • Published • 80 -
Authorship Attribution in the Era of LLMs: Problems, Methodologies, and Challenges
Paper • 2408.08946 • Published • 12
-
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity
Paper • 2305.13169 • Published • 3 -
A Survey on Data Selection for Language Models
Paper • 2402.16827 • Published • 4 -
HuggingFaceFW/fineweb-edu
Viewer • Updated • 3.5B • 168k • 715 -
allenai/MADLAD-400
Updated • 41.4k • 144
-
Better Synthetic Data by Retrieving and Transforming Existing Datasets
Paper • 2404.14361 • Published • 2 -
Generative AI for Synthetic Data Generation: Methods, Challenges and the Future
Paper • 2403.04190 • Published • 1 -
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper • 2404.07503 • Published • 32 -
A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models
Paper • 2404.14445 • Published
-
Phi-4 Technical Report
Paper • 2412.08905 • Published • 119 -
Evaluating and Aligning CodeLLMs on Human Preference
Paper • 2412.05210 • Published • 51 -
Evaluating Language Models as Synthetic Data Generators
Paper • 2412.03679 • Published • 49 -
Yi-Lightning Technical Report
Paper • 2412.01253 • Published • 29
-
AgentInstruct: Toward Generative Teaching with Agentic Flows
Paper • 2407.03502 • Published • 51 -
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing
Paper • 2406.08464 • Published • 70 -
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Paper • 2404.14219 • Published • 257 -
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows
Paper • 2402.10379 • Published • 32
-
Textbooks Are All You Need
Paper • 2306.11644 • Published • 145 -
Textbooks Are All You Need II: phi-1.5 technical report
Paper • 2309.05463 • Published • 87 -
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Paper • 2305.07759 • Published • 36 -
Scaling Synthetic Data Creation with 1,000,000,000 Personas
Paper • 2406.20094 • Published • 102
-
A Systematic Survey of Text Summarization: From Statistical Methods to Large Language Models
Paper • 2406.11289 • Published • 5 -
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper • 2404.07503 • Published • 32 -
Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models
Paper • 2407.12327 • Published • 80 -
Authorship Attribution in the Era of LLMs: Problems, Methodologies, and Challenges
Paper • 2408.08946 • Published • 12
-
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper • 2404.07503 • Published • 32 -
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
Paper • 2404.03715 • Published • 62 -
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing
Paper • 2406.08464 • Published • 70 -
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models
Paper • 2402.13064 • Published • 50
-
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity
Paper • 2305.13169 • Published • 3 -
A Survey on Data Selection for Language Models
Paper • 2402.16827 • Published • 4 -
HuggingFaceFW/fineweb-edu
Viewer • Updated • 3.5B • 168k • 715 -
allenai/MADLAD-400
Updated • 41.4k • 144
-
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
Paper • 2401.16380 • Published • 51 -
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper • 2404.07503 • Published • 32 -
WizardLM: Empowering Large Language Models to Follow Complex Instructions
Paper • 2304.12244 • Published • 14 -
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models
Paper • 2402.13064 • Published • 50
-
Better Synthetic Data by Retrieving and Transforming Existing Datasets
Paper • 2404.14361 • Published • 2 -
Generative AI for Synthetic Data Generation: Methods, Challenges and the Future
Paper • 2403.04190 • Published • 1 -
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper • 2404.07503 • Published • 32 -
A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models
Paper • 2404.14445 • Published