Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20 • 14
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models Paper • 2404.18796 • Published 3 days ago • 47
view article Article ⚗️ 🧑🏼🌾 Let's grow some Domain Specific Datasets together By burtenshaw • 3 days ago • 22
Building a Japanese Document-Level Relation Extraction Dataset Assisted by Cross-Lingual Transfer Paper • 2404.16506 • Published 7 days ago • 1
Can a Multichoice Dataset be Repurposed for Extractive Question Answering? Paper • 2404.17342 • Published 6 days ago • 1
view article Article Post-OCR-Correction: 1 billion words dataset of automated OCR correction by LLM By Pclanglais • 6 days ago • 9
view article Article 🦙⚗️ Using Llama3 and distilabel to build fine-tuning datasets By dvilasuero • 6 days ago • 45
view article Article The Hugging Face Hub for Galleries, Libraries, Archives and Museums Jun 12, 2023 • 1
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone Paper • 2404.14219 • Published 10 days ago • 223
view article Article Releasing Youtube-Commons: a massive open corpus for conversational and multimodal data By Pclanglais • 14 days ago • 19
view article Article Text2SQL using Hugging Face Dataset Viewer API and Motherduck DuckDB-NSQL-7B 28 days ago • 19
view article Article Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset Mar 15 • 3
view article Article Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20 • 14
CodeEditorBench: Evaluating Code Editing Capability of Large Language Models Paper • 2404.03543 • Published 28 days ago • 15
boulderspot Collection find places to climb outside from aerial imagery • 4 items • Updated Apr 1 • 3
ORPO: Monolithic Preference Optimization without Reference Model Paper • 2403.07691 • Published Mar 12 • 53
ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages Paper • 2403.17859 • Published Mar 26 • 2
QuRating: Selecting High-Quality Data for Training Language Models Paper • 2402.09739 • Published Feb 15 • 3
Preference Datasets for KTO Collection This collection contains a list of curated preference datasets for KTO fine-tuning for intent alignment of LLMs through signals. • 5 items • Updated Mar 19 • 10
Common Corpus Collection The largest public domain dataset for training LLMs. • 26 items • Updated Mar 20 • 98
DIBT Prompt Collective Outputs Collection An overview of some of the outputs from the first Data is Better Together task focused on rating the quality of prompts • 5 items • Updated Mar 12 • 1
2024 Paper Reading Sessions Collection Contains some papers that we are reading through internally hosted paper reading sessions. Topics: LLMs, RLHF, synthetic data, benchmarking, etc. • 2 items • Updated Jan 8 • 2
DIBT Prompt collective SPIN Collection This collection contains resources related to the replication of SPIN with the dibt prompt collective dataset • 8 items • Updated Mar 12 • 7
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models Paper • 2403.00231 • Published Mar 1 • 1
NToP: NeRF-Powered Large-scale Dataset Generation for 2D and 3D Human Pose Estimation in Top-View Fisheye Images Paper • 2402.18196 • Published Feb 28 • 1
The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning Paper • 2312.01552 • Published Dec 4, 2023 • 25
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement Paper • 2402.14658 • Published Feb 22 • 77
Diversity Measurement and Subset Selection for Instruction Tuning Datasets Paper • 2402.02318 • Published Feb 4 • 2
SelectLLM: Can LLMs Select Important Instructions to Annotate? Paper • 2401.16553 • Published Jan 29 • 3
LESS: Selecting Influential Data for Targeted Instruction Tuning Paper • 2402.04333 • Published Feb 6 • 3
Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning Paper • 2402.04833 • Published Feb 7 • 6
LLM as a Judge Collection Curated resources that support the use of LLMs to serve as automatic evaluators of other LLM outputs. • 14 items • Updated Feb 19 • 14
Information Extraction Datasets Collection Collection of datasest for various information extraction tasks. • 3 items • Updated Feb 9 • 5
datasets-SPIN Collection Generated synthetic data used to finetune SPIN. • 8 items • Updated Feb 9 • 10
Model Merging Collection Model Merging is a very popular technique nowadays in LLM. Here is a chronological list of papers on the space that will help you get started with it! • 28 items • Updated Mar 23 • 172
LLM Leaderboard best models ❤️🔥 Collection A daily uploaded list of models with best evaluations on the LLM leaderboard: • 65 items • Updated 3 days ago • 289
Qwen1.5 Collection Qwen1.5 is the improved version of Qwen, the large language model series developed by Alibaba Cloud. • 55 items • Updated 3 days ago • 158
Tag-LLM: Repurposing General-Purpose LLMs for Specialized Domains Paper • 2402.05140 • Published Feb 6 • 18
UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised Fine-tuning Dataset Paper • 2402.04588 • Published Feb 7 • 2
SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training? Paper • 2402.01832 • Published Feb 2 • 4
Scaling Laws for Downstream Task Performance of Large Language Models Paper • 2402.04177 • Published Feb 6 • 16
Specialized Language Models with Cheap Inference from Limited Domain Data Paper • 2402.01093 • Published Feb 2 • 45
PeaTMOSS: A Dataset and Initial Analysis of Pre-Trained Models in Open-Source Software Paper • 2402.00699 • Published Feb 1 • 2
PanAf20K: A Large Video Dataset for Wild Ape Detection and Behaviour Recognition Paper • 2401.13554 • Published Jan 24 • 2
Symbrain: A large-scale dataset of MRI images for neonatal brain symmetry analysis Paper • 2401.11814 • Published Jan 22 • 4
Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text Paper • 2401.12070 • Published Jan 22 • 40