Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20 • 24
Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model Aug 22, 2023 • 11
view article Article Training and Finetuning Embedding Models with Sentence Transformers v3 6 days ago • 63
Arabic NoRobots DPO Datasets Collection Our synthetic DPO datasets for Arabic NoRobots. • 4 items • Updated 5 days ago • 3
An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Models are Task-specific Classifiers Paper • 2403.02839 • Published Mar 5 • 1
view article Article ⚗️ 🔥 Building High-Quality Datasets with distilabel and Prometheus 2 By burtenshaw • 5 days ago • 20
sentence-transformers-from-synthetic-data Collection Example of using distilabel to generate synthetic triplets data for fine-tuning a Sentence Transformer model • 3 items • Updated 3 days ago • 15
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework Paper • 2405.11143 • Published 14 days ago • 33
Phi-3 Collection Phi-3 family of small language and multi-modal models. Language models are available in short- and long-context lengths. • 22 items • Updated 3 days ago • 302
MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels Paper • 2405.07526 • Published 21 days ago • 14
Arabic Aya DPO Datasets Collection Our synthetic DPO datasets for Arabic Aya. • 4 items • Updated 5 days ago • 3
ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages using Wikidata Paper • 2405.09496 • Published 18 days ago • 3
Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning Paper • 2307.03692 • Published Jul 5, 2023 • 24
Optimizing Language Model's Reasoning Abilities with Weak Supervision Paper • 2405.04086 • Published 27 days ago • 1
Aloe: A Family of Fine-tuned Open Healthcare LLMs Paper • 2405.01886 • Published about 1 month ago • 2
WizardCoder: Empowering Code Large Language Models with Evol-Instruct Paper • 2306.08568 • Published Jun 14, 2023 • 27
view article Article 🧑⚖️ "Replacing Judges with Juries" using distilabel By alvarobartt • about 1 month ago • 14
Domain Specific Data Collection This is a collection of tools for building domain specific datasets using human domain expertise and synthetic data generation. • 3 items • Updated Apr 15 • 2
HistNERo: Historical Named Entity Recognition for the Romanian Language Paper • 2405.00155 • Published Apr 30 • 4
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models Paper • 2404.18796 • Published Apr 29 • 66
view article Article ⚗️ 🧑🏼🌾 Let's grow some Domain Specific Datasets together By burtenshaw • Apr 29 • 27
Building a Japanese Document-Level Relation Extraction Dataset Assisted by Cross-Lingual Transfer Paper • 2404.16506 • Published Apr 25 • 1
Can a Multichoice Dataset be Repurposed for Extractive Question Answering? Paper • 2404.17342 • Published Apr 26 • 1
view article Article Post-OCR-Correction: 1 billion words dataset of automated OCR correction by LLM By Pclanglais • Apr 26 • 10
view article Article 🦙⚗️ Using Llama3 and distilabel to build fine-tuning datasets By dvilasuero • Apr 26 • 55
view article Article The Hugging Face Hub for Galleries, Libraries, Archives and Museums Jun 12, 2023 • 1
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone Paper • 2404.14219 • Published Apr 22 • 239
view article Article Releasing Youtube-Commons: a massive open corpus for conversational and multimodal data By Pclanglais • Apr 18 • 20
view article Article Text2SQL using Hugging Face Dataset Viewer API and Motherduck DuckDB-NSQL-7B Apr 4 • 20
view article Article Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset Mar 15 • 5
view article Article Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20 • 24
CodeEditorBench: Evaluating Code Editing Capability of Large Language Models Paper • 2404.03543 • Published Apr 4 • 15
boulderspot Collection find places to climb outside from aerial imagery • 4 items • Updated Apr 1 • 3
ORPO: Monolithic Preference Optimization without Reference Model Paper • 2403.07691 • Published Mar 12 • 58
ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages Paper • 2403.17859 • Published Mar 26 • 2
QuRating: Selecting High-Quality Data for Training Language Models Paper • 2402.09739 • Published Feb 15 • 3
Preference Datasets for KTO Collection This collection contains a list of curated preference datasets for KTO fine-tuning for intent alignment of LLMs through signals. • 5 items • Updated Mar 19 • 10
Common Corpus Collection The largest public domain dataset for training LLMs. • 26 items • Updated Mar 20 • 103
DIBT Prompt Collective Outputs Collection An overview of some of the outputs from the first Data is Better Together task focused on rating the quality of prompts • 5 items • Updated Mar 12 • 1
2024 Paper Reading Sessions Collection Contains some papers that we are reading through internally hosted paper reading sessions. Topics: LLMs, RLHF, synthetic data, benchmarking, etc. • 2 items • Updated Jan 8 • 2
DIBT Prompt collective SPIN Collection This collection contains resources related to the replication of SPIN with the dibt prompt collective dataset • 8 items • Updated Mar 12 • 7
UDOP Collection UDOP is a general multimodal model for document AI • 4 items • Updated 12 days ago • 20
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models Paper • 2403.00231 • Published Mar 1 • 1
NToP: NeRF-Powered Large-scale Dataset Generation for 2D and 3D Human Pose Estimation in Top-View Fisheye Images Paper • 2402.18196 • Published Feb 28 • 1
The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning Paper • 2312.01552 • Published Dec 4, 2023 • 26
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement Paper • 2402.14658 • Published Feb 22 • 78
Diversity Measurement and Subset Selection for Instruction Tuning Datasets Paper • 2402.02318 • Published Feb 4 • 2