Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20 • 17
Arabic DPO Datasets Collection Our synthetic DPO datasets for Arabic. • 3 items • Updated 1 day ago • 2
ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages using Wikidata Paper • 2405.09496 • Published 2 days ago • 1
Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning Paper • 2307.03692 • Published Jul 5, 2023 • 24
Optimizing Language Model's Reasoning Abilities with Weak Supervision Paper • 2405.04086 • Published 11 days ago • 1
WizardCoder: Empowering Code Large Language Models with Evol-Instruct Paper • 2306.08568 • Published Jun 14, 2023 • 27
view article Article 🧑⚖️ "Replacing Judges with Juries" using distilabel By alvarobartt • 15 days ago • 14
Domain Specific Data Collection This is a collection of tools for building domain specific datasets using human domain expertise and synthetic data generation. • 3 items • Updated Apr 15 • 2
HistNERo: Historical Named Entity Recognition for the Romanian Language Paper • 2405.00155 • Published 17 days ago • 2
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models Paper • 2404.18796 • Published 19 days ago • 62
view article Article ⚗️ 🧑🏼🌾 Let's grow some Domain Specific Datasets together By burtenshaw • 19 days ago • 25
Building a Japanese Document-Level Relation Extraction Dataset Assisted by Cross-Lingual Transfer Paper • 2404.16506 • Published 23 days ago • 1
Can a Multichoice Dataset be Repurposed for Extractive Question Answering? Paper • 2404.17342 • Published 22 days ago • 1
view article Article Post-OCR-Correction: 1 billion words dataset of automated OCR correction by LLM By Pclanglais • 22 days ago • 10
view article Article 🦙⚗️ Using Llama3 and distilabel to build fine-tuning datasets By dvilasuero • 21 days ago • 54
view article Article The Hugging Face Hub for Galleries, Libraries, Archives and Museums Jun 12, 2023 • 1
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone Paper • 2404.14219 • Published 26 days ago • 230
view article Article Releasing Youtube-Commons: a massive open corpus for conversational and multimodal data By Pclanglais • 30 days ago • 20
view article Article Text2SQL using Hugging Face Dataset Viewer API and Motherduck DuckDB-NSQL-7B Apr 4 • 20
view article Article Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset Mar 15 • 4
view article Article Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20 • 17
CodeEditorBench: Evaluating Code Editing Capability of Large Language Models Paper • 2404.03543 • Published Apr 4 • 15
boulderspot Collection find places to climb outside from aerial imagery • 4 items • Updated Apr 1 • 3
ORPO: Monolithic Preference Optimization without Reference Model Paper • 2403.07691 • Published Mar 12 • 54
ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages Paper • 2403.17859 • Published Mar 26 • 2
QuRating: Selecting High-Quality Data for Training Language Models Paper • 2402.09739 • Published Feb 15 • 3
Preference Datasets for KTO Collection This collection contains a list of curated preference datasets for KTO fine-tuning for intent alignment of LLMs through signals. • 5 items • Updated Mar 19 • 10
Common Corpus Collection The largest public domain dataset for training LLMs. • 26 items • Updated Mar 20 • 102
DIBT Prompt Collective Outputs Collection An overview of some of the outputs from the first Data is Better Together task focused on rating the quality of prompts • 5 items • Updated Mar 12 • 1
2024 Paper Reading Sessions Collection Contains some papers that we are reading through internally hosted paper reading sessions. Topics: LLMs, RLHF, synthetic data, benchmarking, etc. • 2 items • Updated Jan 8 • 2
DIBT Prompt collective SPIN Collection This collection contains resources related to the replication of SPIN with the dibt prompt collective dataset • 8 items • Updated Mar 12 • 7
UDOP Collection UDOP is a general multimodal model for document AI • 4 items • Updated 9 days ago • 20
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models Paper • 2403.00231 • Published Mar 1 • 1
NToP: NeRF-Powered Large-scale Dataset Generation for 2D and 3D Human Pose Estimation in Top-View Fisheye Images Paper • 2402.18196 • Published Feb 28 • 1
The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning Paper • 2312.01552 • Published Dec 4, 2023 • 26
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement Paper • 2402.14658 • Published Feb 22 • 77
Diversity Measurement and Subset Selection for Instruction Tuning Datasets Paper • 2402.02318 • Published Feb 4 • 2
SelectLLM: Can LLMs Select Important Instructions to Annotate? Paper • 2401.16553 • Published Jan 29 • 3
LESS: Selecting Influential Data for Targeted Instruction Tuning Paper • 2402.04333 • Published Feb 6 • 3
Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning Paper • 2402.04833 • Published Feb 7 • 6
LLM as a Judge Collection Curated resources that support the use of LLMs to serve as automatic evaluators of other LLM outputs. • 15 items • Updated 1 day ago • 15
Information Extraction Datasets Collection Collection of datasest for various information extraction tasks. • 3 items • Updated Feb 9 • 5
datasets-SPIN Collection Generated synthetic data used to finetune SPIN. • 8 items • Updated Feb 9 • 10
Model Merging Collection Model Merging is a very popular technique nowadays in LLM. Here is a chronological list of papers on the space that will help you get started with it! • 28 items • Updated Mar 23 • 180
LLM Leaderboard best models ❤️🔥 Collection A daily uploaded list of models with best evaluations on the LLM leaderboard: • 70 items • Updated 1 day ago • 304
Qwen1.5 Collection Qwen1.5 is the improved version of Qwen, the large language model series developed by Alibaba Cloud. • 55 items • Updated 5 days ago • 169