view article Article Docmatix - a huge dataset for Document Visual Question Answering 5 days ago • 43
AgentInstruct: Toward Generative Teaching with Agentic Flows Paper • 2407.03502 • Published 19 days ago • 34
view article Article Experimenting with Automatic PII Detection on the Hub using Presidio 13 days ago • 19
ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild Paper • 2407.04172 • Published 18 days ago • 19
view article Article The Great LLM Showdown: Amy's Quest for the Perfect LLM By wolfram • 13 days ago • 11
🇩🇪German SFT and DPO datasets Collection Datasets that can be used for LLM training with axolotl, trl or llama_factory. • 30 items • Updated May 27 • 9
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Paper • 2406.17557 • Published 27 days ago • 75
view article Article Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models 29 days ago • 142
view article Article BM25 for Python: Achieving high performance while simplifying dependencies with *BM25S*⚡ By xhluca • 13 days ago • 29
GenQA: Generating Millions of Instructions from a Handful of Prompts Paper • 2406.10323 • Published Jun 14 • 5
Embedding Model Datasets Collection A curated subset of the datasets that work out of the box with Sentence Transformers: https://huggingface.co/datasets?other=sentence-transformers • 67 items • Updated 19 days ago • 52
GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks Paper • 2406.12925 • Published Jun 14 • 20
Instruction Pre-Training: Language Models are Supervised Multitask Learners Paper • 2406.14491 • Published Jun 20 • 76
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models Paper • 2404.18796 • Published Apr 29 • 67
TabuLa-8B Collection Training, eval suite, and model from the paper "Large Scale Transfer Learning for Tabular Data via Language Modeling" https://arxiv.org/abs/2406.12031 • 4 items • Updated Jun 19 • 9
Depth Anything v2 Release Collection A comprehensive collection on DAv2 • 5 items • Updated Jun 18 • 9
FP8 LLMs for vLLM Collection Accurate FP8 quantized models by Neural Magic, ready for use with vLLM! • 22 items • Updated about 19 hours ago • 32
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing Paper • 2406.08464 • Published Jun 12 • 49
Local Function Calling Gems Collection These are the best function calling LLMs one can run on less than 64GB VRAM/Unified Memory. I use these on a M1 Max Macbook 64GB. • 6 items • Updated 21 days ago • 3
Qwen2 Collection Qwen2 language models, including pretrained and instruction-tuned models of 5 sizes, including 0.5B, 1.5B, 7B, 57B-A14B, and 72B. • 29 items • Updated Jun 6 • 260
DeTikZify Collection Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ • 9 items • Updated Jun 3 • 2
view article Article Releasing Common Corpus: the largest public domain dataset for training LLMs By Pclanglais • Mar 20 • 13
view article Article How to directly access 150k+ Hugging Face Datasets with DuckDB and query using GPT-4o By chilijung • May 31 • 10
view article Article ⚗️ 🔥 Building High-Quality Datasets with distilabel and Prometheus 2 By burtenshaw • Jun 3 • 21
sentence-transformers-from-synthetic-data Collection Example of using distilabel to generate synthetic triplets data for fine-tuning a Sentence Transformer model • 4 items • Updated Jun 21 • 20
view article Article Training and Finetuning Embedding Models with Sentence Transformers v3 May 28 • 124
Granite Code Models: A Family of Open Foundation Models for Code Intelligence Paper • 2405.04324 • Published May 7 • 15
view article Article GPU Poor Savior: Revolutionizing Low-Bit Open Source LLMs and Cost-Effective Edge Computing By NicoNico • May 25 • 9
DiscoLeo 8B: Llama3 for German Collection Continued Pretraining on Llama3 8B to improve German linguistic capabilities. A collection of base and fine-tuned models and variants. • 5 items • Updated May 25 • 14
DiscoLeo 8B quants Collection A collection of different quantizations of the DiscoLeo models. • 3 items • Updated May 25 • 3
C4AI Aya 23 Collection Aya 23 is an open weights research release of an instruction fine-tuned model with highly advanced multilingual capabilities. • 3 items • Updated May 23 • 40
view article Article ⚗️ 🧑🏼🌾 Let's grow some Domain Specific Datasets together By burtenshaw • Apr 29 • 28
C4AI Command R Plus Collection C4AI Command R+ is an open weights research release of a 104B billion parameter model with highly advanced capabilities. • 3 items • Updated May 23 • 27
Phi-3 Collection Phi-3 family of small language and multi-modal models. Language models are available in short- and long-context lengths. • 23 items • Updated 11 days ago • 372
CommonCatalog Collection Common Catalog, a dataset with Creative Commons licensed images and machine-generated caption pairs • 8 items • Updated May 16 • 13