Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model Aug 22, 2023 • 14
view article Article BM25 for Python: Achieving high performance while simplifying dependencies with *BM25S*⚡ By xhluca • 7 days ago • 25
view article Article 📚 Training Data Transparency in AI: Tools, Trends, and Policy Recommendations 🗳️ By yjernite • Dec 5, 2023 • 1
view article Article Ethics and Society Newsletter #6: Building Better AI: The Importance of Data Quality 9 days ago • 23
view article Article Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation By davanstrien • 13 days ago • 11
view article Article Open-source embeddings and LLMs outperform Gemini and OpenAI for Web Navigation while being faster and cheaper By dhuynh95 • 12 days ago • 4
view article Article BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks 15 days ago • 30
view article Article Unveiling CIVICS: A New Dataset for Examining Cultural Values in Language Models By giadap • 14 days ago • 7
Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback Paper • 2406.09279 • Published 19 days ago • 1
Tulu V2.5 Suite Collection A suite of models trained using DPO and PPO across a wide variety (up to 14) of preference datasets. See https://arxiv.org/abs/2406.09279 for more! • 41 items • Updated 19 days ago • 9
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs Paper • 2402.14740 • Published Feb 22 • 6
view article Article Reports on the Hub: A First Look at Self-governance in Open Source AI Development By frimelle • 21 days ago • 7
view article Article How to build an interactive HF Space to visualize an Image Dataset By MarkusStoll • Dec 18, 2023 • 2
Hugging Face community’s Wikimedia datasets Collection Wikimedia datasets created by the Hugging Face community, not Wikimedia. Sorted by Wikimedia project. • 17 items • Updated 26 days ago • 6
StarChat2 15B Collection Model, datasets, and demo for StarChat2 15B. For code to train the models, see: https://github.com/huggingface/alignment-handbook • 10 items • Updated Apr 12 • 13
view article Article How to directly access 150k+ Hugging Face Datasets with DuckDB and query using GPT-4o By chilijung • May 31 • 10
view article Article Wikipedia's Treasure Trove: Advancing Machine Learning with Diverse Data By frimelle • 30 days ago • 12
📚 FineWeb-Edu Collection FineWeb-Edu datasets, classifier and ablation model • 5 items • Updated 21 days ago • 7
CommonCanvas Collection Collection of models trained on the CommonCatalogue datasets • 8 items • Updated May 16 • 6
CommonCatalog Collection Common Catalog, a dataset with Creative Commons licensed images and machine-generated caption pairs • 8 items • Updated May 16 • 12
Wikimedia Datasets Collection Wikimedia datasets, across languages and modalities, from different Wikimedia projects, on the hub. Not all tested. • 19 items • Updated May 16 • 9
view article Article ⚗️ 🧑🏼🌾 Let's grow some Domain Specific Datasets together By burtenshaw • Apr 29 • 27
view article Article Releasing Youtube-Commons: a massive open corpus for conversational and multimodal data By Pclanglais • Apr 18 • 20
Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies Paper • 2404.08197 • Published Apr 12 • 26
view article Article Policy Questions Blog 1: AI Data Transparency Remarks for NAIAC Panel 📚🔍⚖️ By yjernite • Mar 27 • 2
Creación de corpus en comunidad Collection Colección de esfuerzos colaborativos para crear corpus en español de calidad. Toda persona hispanohablante puede contribuir :) • 7 items • Updated May 6 • 6
Common Corpus Collection The largest public domain dataset for training LLMs. • 27 items • Updated 16 days ago • 107
Chronos Models & Datasets Collection Chronos: Pretrained (language) models for time series forecasting based on the T5 architecture. • 8 items • Updated 5 days ago • 26
MetricX-23 Collection A collection of MetricX-23 models (https://aclanthology.org/2023.wmt-1.63/) • 6 items • Updated 6 days ago • 13
Awesome Document AI Collection A collection of open-source document AI 📄 📝 📈 • 27 items • Updated Mar 11 • 41
🇮🇹 Italian NLP Resources Collection Collection of models, datasets and demos relevant to Italian NLP 🇮🇹 • 189 items • Updated 1 day ago • 18
Sora Reference Papers Collection A collection of all papers referenced in OpenAI's "Video generation models as world simulators" technical report • openai.com/sora • 30 items • Updated Feb 20 • 51
GritLM Collection Generative Representational Instruction Tuning (GRIT) • 64 items • Updated Apr 17 • 5
⛔️🔦 Provenance, Watermarking & Deepfake Detection Collection Technical tools for more control over non-consensual synthetic content • 14 items • Updated Apr 1 • 37
Historic Newsaper Datasets Collection Historic Newspaper Datasets on the Hub • 13 items • Updated May 2 • 3
🔍 Daily Picks in Interpretability & Analysis of LMs Collection Outstanding research in interpretability and evaluation of language models, summarized • 59 items • Updated about 12 hours ago • 67
WAVES Collection Benchmarking the Robustness of Image Watermarks. Under development. Data will be released soon. • 2 items • Updated Jan 24 • 2
Zeroshot Classifiers Collection These are my current best zeroshot classifiers. Some of my older models are downloaded more often, but the models in this collection are newer/better. • 11 items • Updated Apr 3 • 84
Korean Datasets I've released so far. Collection 지금까지 업로드한 한국어 데이터셋 콜렉션입니다. • 8 items • Updated May 24 • 15
Tulu V2 Suite Collection The set of models associated with the paper "Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2" • 19 items • Updated 22 days ago • 43
Custom Components ✨ Collection Awesome gradio custom components to get you started build your own! • 7 items • Updated Nov 20, 2023 • 33
Reward models on the hub Collection UNMAINTAINED: See RewardBench... A place to collect reward models, an often not released artifact of RLHF. • 18 items • Updated Apr 13 • 24
Medical QA Datasets Collection A collection of medical question answering (QA) datasets • 19 items • Updated Oct 31, 2023 • 16
Leaderboards and benchmarks ✨ Collection Cool leaderboard spaces collection for models across modalities! Text, vision, audio, ... • 64 items • Updated 22 days ago • 69
Fair Diffusion: Instructing Text-to-Image Generation Models on Fairness Paper • 2302.10893 • Published Feb 7, 2023 • 6
AI Ethics projects in Spanish Collection Datasets, models and spaces related to hate speech detection and bias evaluation in Spanish. • 17 items • Updated Apr 13 • 6
Resources: Bias, Stereotypes, and Representational Harms Collection Linking collected resources for this category that have a dataset, model, or demo on Hugging Face or a paper on ArXiv (inked through Hugging Face) • 20 items • Updated Feb 17 • 1