Synthetic dataset generation techniques: generating custom sentence similarity data about 8 hours ago • 7
Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20 • 22
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework Paper • 2405.11143 • Published 4 days ago • 27
Phi-3 Collection Phi-3 family of small language and multi-modal models. Language models are available in short- and long-context lengths. • 20 items • Updated 1 day ago • 266
MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels Paper • 2405.07526 • Published 10 days ago • 14
Arabic DPO Datasets Collection Our synthetic DPO datasets for Arabic. • 3 items • Updated 7 days ago • 2
ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages using Wikidata Paper • 2405.09496 • Published 8 days ago • 3
Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning Paper • 2307.03692 • Published Jul 5, 2023 • 24
Optimizing Language Model's Reasoning Abilities with Weak Supervision Paper • 2405.04086 • Published 16 days ago • 1
WizardCoder: Empowering Code Large Language Models with Evol-Instruct Paper • 2306.08568 • Published Jun 14, 2023 • 27
view article Article 🧑⚖️ "Replacing Judges with Juries" using distilabel By alvarobartt • 20 days ago • 14
Domain Specific Data Collection This is a collection of tools for building domain specific datasets using human domain expertise and synthetic data generation. • 3 items • Updated Apr 15 • 2
HistNERo: Historical Named Entity Recognition for the Romanian Language Paper • 2405.00155 • Published 23 days ago • 3
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models Paper • 2404.18796 • Published 24 days ago • 63
view article Article ⚗️ 🧑🏼🌾 Let's grow some Domain Specific Datasets together By burtenshaw • 24 days ago • 26
Building a Japanese Document-Level Relation Extraction Dataset Assisted by Cross-Lingual Transfer Paper • 2404.16506 • Published 28 days ago • 1
Can a Multichoice Dataset be Repurposed for Extractive Question Answering? Paper • 2404.17342 • Published 27 days ago • 1
view article Article Post-OCR-Correction: 1 billion words dataset of automated OCR correction by LLM By Pclanglais • 27 days ago • 10
view article Article 🦙⚗️ Using Llama3 and distilabel to build fine-tuning datasets By dvilasuero • 27 days ago • 55
view article Article The Hugging Face Hub for Galleries, Libraries, Archives and Museums Jun 12, 2023 • 1
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone Paper • 2404.14219 • Published Apr 22 • 235
view article Article Releasing Youtube-Commons: a massive open corpus for conversational and multimodal data By Pclanglais • Apr 18 • 20
view article Article Text2SQL using Hugging Face Dataset Viewer API and Motherduck DuckDB-NSQL-7B Apr 4 • 20
view article Article Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset Mar 15 • 5
view article Article Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20 • 22
CodeEditorBench: Evaluating Code Editing Capability of Large Language Models Paper • 2404.03543 • Published Apr 4 • 15
boulderspot Collection find places to climb outside from aerial imagery • 4 items • Updated Apr 1 • 3
ORPO: Monolithic Preference Optimization without Reference Model Paper • 2403.07691 • Published Mar 12 • 57
ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages Paper • 2403.17859 • Published Mar 26 • 2
QuRating: Selecting High-Quality Data for Training Language Models Paper • 2402.09739 • Published Feb 15 • 3
Preference Datasets for KTO Collection This collection contains a list of curated preference datasets for KTO fine-tuning for intent alignment of LLMs through signals. • 5 items • Updated Mar 19 • 10
Common Corpus Collection The largest public domain dataset for training LLMs. • 26 items • Updated Mar 20 • 103
DIBT Prompt Collective Outputs Collection An overview of some of the outputs from the first Data is Better Together task focused on rating the quality of prompts • 5 items • Updated Mar 12 • 1
2024 Paper Reading Sessions Collection Contains some papers that we are reading through internally hosted paper reading sessions. Topics: LLMs, RLHF, synthetic data, benchmarking, etc. • 2 items • Updated Jan 8 • 2
DIBT Prompt collective SPIN Collection This collection contains resources related to the replication of SPIN with the dibt prompt collective dataset • 8 items • Updated Mar 12 • 7
UDOP Collection UDOP is a general multimodal model for document AI • 4 items • Updated 1 day ago • 20
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models Paper • 2403.00231 • Published Mar 1 • 1
NToP: NeRF-Powered Large-scale Dataset Generation for 2D and 3D Human Pose Estimation in Top-View Fisheye Images Paper • 2402.18196 • Published Feb 28 • 1
The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning Paper • 2312.01552 • Published Dec 4, 2023 • 26
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement Paper • 2402.14658 • Published Feb 22 • 77
Diversity Measurement and Subset Selection for Instruction Tuning Datasets Paper • 2402.02318 • Published Feb 4 • 2
SelectLLM: Can LLMs Select Important Instructions to Annotate? Paper • 2401.16553 • Published Jan 29 • 3
LESS: Selecting Influential Data for Targeted Instruction Tuning Paper • 2402.04333 • Published Feb 6 • 3
Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning Paper • 2402.04833 • Published Feb 7 • 6