Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model Aug 22, 2023 • 11
CommonCanvas Collection Collection of models trained on the CommonCatalogue datasets • 8 items • Updated 15 days ago • 6
CommonCatalog Collection Common Catalog, a dataset with Creative Commons licensed images and machine-generated caption pairs • 8 items • Updated 15 days ago • 7
Wikimedia Datasets Collection Wikimedia datasets, across languages and modalities, from different Wikimedia projects, on the hub. Not all tested. • 19 items • Updated 16 days ago • 9
view article Article ⚗️ 🧑🏼🌾 Let's grow some Domain Specific Datasets together By burtenshaw • Apr 29 • 27
view article Article Releasing Youtube-Commons: a massive open corpus for conversational and multimodal data By Pclanglais • Apr 18 • 20
Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies Paper • 2404.08197 • Published Apr 12 • 26
view article Article Policy Questions Blog 1: AI Data Transparency Remarks for NAIAC Panel 📚🔍⚖️ By yjernite • Mar 27 • 1
Creación de corpus en comunidad Collection Colección de esfuerzos colaborativos para crear corpus en español de calidad. Toda persona hispanohablante puede contribuir :) • 7 items • Updated 25 days ago • 6
Common Corpus Collection The largest public domain dataset for training LLMs. • 26 items • Updated Mar 20 • 103
Chronos Models Collection Chronos: Pretrained (language) models for time series forecasting based on the T5 architecture. • 6 items • Updated Mar 18 • 25
MetricX-23 Collection A collection of MetricX-23 models (https://aclanthology.org/2023.wmt-1.63/) • 6 items • Updated 17 days ago • 13
Awesome Document AI Collection A collection of open-source document AI 📄 📝 📈 • 27 items • Updated Mar 11 • 38
🇮🇹 Italian NLP Resources Collection Collection of models, datasets and demos relevant to Italian NLP 🇮🇹 • 182 items • Updated about 17 hours ago • 18
Sora Reference Papers Collection A collection of all papers referenced in OpenAI's "Video generation models as world simulators" technical report • openai.com/sora • 30 items • Updated Feb 20 • 50
GritLM Collection Generative Representational Instruction Tuning (GRIT) • 64 items • Updated Apr 17 • 4
⛔️🔦 Provenance, Watermarking & Deepfake Detection Collection Technical tools for more control over non-consensual synthetic content • 14 items • Updated Apr 1 • 36
Historic Newsaper Datasets Collection Historic Newspaper Datasets on the Hub • 13 items • Updated 30 days ago • 3
🔍 Daily Picks in Interpretability & Analysis of LMs Collection Outstanding research in interpretability and evaluation of language models, summarized • 51 items • Updated about 17 hours ago • 54
WAVES Collection Benchmarking the Robustness of Image Watermarks. Under development. Data will be released soon. • 2 items • Updated Jan 24 • 2
Zeroshot Classifiers Collection These are my current best zeroshot classifiers. Some of my older models are downloaded more often, but the models in this collection are newer/better. • 11 items • Updated Apr 3 • 79
Korean Datasets I've released so far. Collection 지금까지 업로드한 한국어 데이터셋 콜렉션입니다. • 8 items • Updated 7 days ago • 14
Tulu V2 Suite Collection The set of models associated with the paper "Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2" • 19 items • Updated Feb 1 • 43
Custom Components ✨ Collection Awesome gradio custom components to get you started build your own! • 7 items • Updated Nov 20, 2023 • 31
Reward models on the hub Collection UNMAINTAINED: See RewardBench... A place to collect reward models, an often not released artifact of RLHF. • 18 items • Updated Apr 13 • 24
Medical QA Datasets Collection A collection of medical question answering (QA) datasets • 19 items • Updated Oct 31, 2023 • 16
Leaderboards and benchmarks ✨ Collection Cool leaderboard spaces collection for models across modalities! Text, vision, audio, ... • 62 items • Updated 11 days ago • 62
Fair Diffusion: Instructing Text-to-Image Generation Models on Fairness Paper • 2302.10893 • Published Feb 7, 2023 • 5
AI Ethics projects in Spanish Collection Datasets, models and spaces related to hate speech detection and bias evaluation in Spanish. • 17 items • Updated Apr 13 • 6
Resources: Bias, Stereotypes, and Representational Harms Collection Linking collected resources for this category that have a dataset, model, or demo on Hugging Face or a paper on ArXiv (inked through Hugging Face) • 20 items • Updated Feb 17 • 1
Sourced from Wikimedia Collection Wikimedia collections, i.e. Wikipedia, are heavily used in ML research. This collection highlights some prominent examples of these datasets. • 9 items • Updated 3 days ago • 2
DIY AI For Journalists Collection Compiling resources useful for journalists building prototypes with AI • 8 items • Updated Sep 18, 2023 • 10
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks Paper • 2309.17410 • Published Sep 29, 2023 • 4
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models Paper • 2309.01219 • Published Sep 3, 2023 • 2
Domain specific data and model documentation Collection There is a growing number of datasheets or model card frameworks being proposed for particular domains. This collection tries to capture some of these • 5 items • Updated Oct 5, 2023 • 2
Domain specific data and model documentation Collection There is a growing number of datasheets or model card frameworks being proposed for particular domains. This collection tries to capture some of these • 6 items • Updated Oct 5, 2023 • 1
Christopher Collection You can find the best of Christopher's work here • 13 items • Updated Nov 8, 2023 • 1
🗳️ AI for Policymakers Collection AI systems have much to offer to policymakers, both as a tool to support their work and as a technology that can improve access to public services. • 13 items • Updated Mar 8 • 7