Models
Datasets
Spaces
Posts
Docs
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2406.11832

Papers - Attention - Cross

Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

Paper • 2403.12943 • Published Mar 19 • 14
Masked Audio Generation using a Single Non-Autoregressive Transformer

Paper • 2401.04577 • Published Jan 9 • 42
Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models

Paper • 2404.02747 • Published Apr 3 • 11
InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image Generation

Paper • 2404.02733 • Published Apr 3 • 20

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6 • 25
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6 • 12
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7 • 38
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7 • 19

(Papers) Multimodal

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 47
Visual Instruction Tuning

Paper • 2304.08485 • Published Apr 17, 2023 • 13
Improved Baselines with Visual Instruction Tuning

Paper • 2310.03744 • Published Oct 5, 2023 • 37
Making Large Multimodal Models Understand Arbitrary Visual Prompts

Paper • 2312.00784 • Published Dec 1, 2023 • 2

Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models

Paper • 2406.17294 • Published Jun 25 • 10
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Paper • 2406.19389 • Published Jun 27 • 51
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

Paper • 2406.20076 • Published Jun 28 • 8
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation

Paper • 2407.02869 • Published Jul 3 • 18

vision language models (VLM)

PaliGemma: A versatile 3B VLM for transfer

Paper • 2407.07726 • Published Jul 10 • 67
Vision language models are blind

Paper • 2407.06581 • Published Jul 9 • 82
CosmoCLIP: Generalizing Large Vision-Language Models for Astronomical Imaging

Paper • 2407.07315 • Published Jul 10 • 6
Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

Paper • 2407.06189 • Published Jul 8 • 24

Papers - Multimodal - Training - LLM Guided Pre-training

Unveiling Encoder-Free Vision-Language Models

Paper • 2406.11832 • Published Jun 17 • 49

Papers - Multimodal - Training - Loss - Cross Entropy

Unveiling Encoder-Free Vision-Language Models

Paper • 2406.11832 • Published Jun 17 • 49

Papers - Attention - Decoder Only

Unveiling Encoder-Free Vision-Language Models

Paper • 2406.11832 • Published Jun 17 • 49

Papers - Multimodal - Training - Patch Aligning Layer

Unveiling Encoder-Free Vision-Language Models

Paper • 2406.11832 • Published Jun 17 • 49

Papers - Multimodal - Training - Decoder Only

Unveiling Encoder-Free Vision-Language Models

Paper • 2406.11832 • Published Jun 17 • 49

Previous
1
2
3
Next

Company

© Hugging Face

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs