Models
Datasets
Spaces
Posts
Docs
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2407.04172

Vision-Language

EVLM: An Efficient Vision-Language Model for Visual Understanding

Paper • 2407.14177 • Published Jul 19 • 42
ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild

Paper • 2407.04172 • Published Jul 4 • 22
facebook/chameleon-7b

Image-Text-to-Text • Updated Jul 23 • 20.5k • 166
vidore/colpali

Updated Sep 27 • 30.7k • 380

Papers - Image - Charts

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

Paper • 2406.18521 • Published Jun 26 • 28
ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild

Paper • 2407.04172 • Published Jul 4 • 22

Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models

Paper • 2406.17294 • Published Jun 25 • 10
TokenPacker: Efficient Visual Projector for Multimodal LLM

Paper • 2407.02392 • Published Jul 2 • 21
Understanding Alignment in Multimodal LLMs: A Comprehensive Study

Paper • 2407.02477 • Published Jul 2 • 21
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Paper • 2407.03320 • Published Jul 3 • 92

Papers - Image - Charts - QA - Reasoning

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

Paper • 2406.18521 • Published Jun 26 • 28
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Paper • 2407.01284 • Published Jul 1 • 75
ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild

Paper • 2407.04172 • Published Jul 4 • 22

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Paper • 2405.15223 • Published May 24 • 12
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24 • 53
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27 • 85
Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27 • 31

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Paper • 2405.09818 • Published May 16 • 126
Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27 • 31
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Paper • 2406.02430 • Published Jun 4 • 30
An Image is Worth 32 Tokens for Reconstruction and Generation

Paper • 2406.07550 • Published Jun 11 • 55

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

Paper • 2306.17107 • Published Jun 29, 2023 • 11
On the Hidden Mystery of OCR in Large Multimodal Models

Paper • 2305.07895 • Published May 13, 2023
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

Paper • 2308.12966 • Published Aug 24, 2023 • 7
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Paper • 2401.15947 • Published Jan 29 • 49

MoEs papers reading list

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Paper • 1701.06538 • Published Jan 23, 2017 • 5
Sparse Networks from Scratch: Faster Training without Losing Performance

Paper • 1907.04840 • Published Jul 10, 2019 • 3
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Paper • 1910.02054 • Published Oct 4, 2019 • 4
A Mixture of h-1 Heads is Better than h Heads

Paper • 2005.06537 • Published May 13, 2020 • 2

Vision Language Models

UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized Multimodal Framework

Paper • 2311.10125 • Published Nov 16, 2023 • 4
ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild

Paper • 2407.04172 • Published Jul 4 • 22

Company

© Hugging Face

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs