che111
's Collections
interesting paper
updated
Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization
Paper
•
2311.06243
•
Published
•
17
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor
Cores
Paper
•
2311.05908
•
Published
•
11
PolyMaX: General Dense Prediction with Mask Transformer
Paper
•
2311.05770
•
Published
•
6
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for
Multi-modal Large Language Models
Paper
•
2311.07575
•
Published
•
9
GLaMM: Pixel Grounding Large Multimodal Model
Paper
•
2311.03356
•
Published
•
31
CoVLM: Composing Visual Entities and Relationships in Large Language
Models Via Communicative Decoding
Paper
•
2311.03354
•
Published
•
4
Attention or Convolution: Transformer Encoders in Audio Language Models
for Inference Efficiency
Paper
•
2311.02772
•
Published
•
3
CogVLM: Visual Expert for Pretrained Language Models
Paper
•
2311.03079
•
Published
•
18
DaTaSeg: Taming a Universal Multi-Dataset Multi-Task Segmentation Model
Paper
•
2306.01736
•
Published
•
1
Exploring the Boundaries of GPT-4 in Radiology
Paper
•
2310.14573
•
Published
•
7
SelfEval: Leveraging the discriminative nature of generative models for
evaluation
Paper
•
2311.10708
•
Published
•
14
UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized
Multimodal Framework
Paper
•
2311.10125
•
Published
•
4
segmind/SSD-1B
Text-to-Image
•
Updated
•
81.8k
•
759
HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion
Models
Paper
•
2312.00079
•
Published
•
14
Paper
•
2312.02149
•
Published
•
4
Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis
Paper
•
2312.03491
•
Published
•
34
Self-conditioned Image Generation via Generating Representations
Paper
•
2312.03701
•
Published
•
6
Language-Informed Visual Concept Learning
Paper
•
2312.03587
•
Published
•
4
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
Paper
•
2312.03818
•
Published
•
31
Scaling Laws of Synthetic Images for Model Training ... for Now
Paper
•
2312.04567
•
Published
•
7
Progressive Knowledge Distillation Of Stable Diffusion XL Using Layer
Level Loss
Paper
•
2401.02677
•
Published
•
21
Denoising Vision Transformers
Paper
•
2401.02957
•
Published
•
26
Transformers are Multi-State RNNs
Paper
•
2401.06104
•
Published
•
33
TOFU: A Task of Fictitious Unlearning for LLMs
Paper
•
2401.06121
•
Published
•
14
Patchscope: A Unifying Framework for Inspecting Hidden Representations
of Language Models
Paper
•
2401.06102
•
Published
•
18
LEGO:Language Enhanced Multi-modal Grounding Model
Paper
•
2401.06071
•
Published
•
10
Distilling Vision-Language Models on Millions of Videos
Paper
•
2401.06129
•
Published
•
13
Towards Conversational Diagnostic AI
Paper
•
2401.05654
•
Published
•
13
Medusa: Simple LLM Inference Acceleration Framework with Multiple
Decoding Heads
Paper
•
2401.10774
•
Published
•
50
CheXagent: Towards a Foundation Model for Chest X-Ray Interpretation
Paper
•
2401.12208
•
Published
•
20
MM-LLMs: Recent Advances in MultiModal Large Language Models
Paper
•
2401.13601
•
Published
•
41
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other
Modalities
Paper
•
2401.14405
•
Published
•
11
Rethinking Patch Dependence for Masked Autoencoders
Paper
•
2401.14391
•
Published
•
22
Deconstructing Denoising Diffusion Models for Self-Supervised Learning
Paper
•
2401.14404
•
Published
•
16
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper
•
2401.15947
•
Published
•
46
More Agents Is All You Need
Paper
•
2402.05120
•
Published
•
46
Memory Consolidation Enables Long-Context Video Understanding
Paper
•
2402.05861
•
Published
•
7
InstaGen: Enhancing Object Detection by Training on Synthetic Dataset
Paper
•
2402.05937
•
Published
•
8
An Interactive Agent Foundation Model
Paper
•
2402.05929
•
Published
•
24
Question Aware Vision Transformer for Multimodal Reasoning
Paper
•
2402.05472
•
Published
•
5
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on
Deceptive Prompts
Paper
•
2402.13220
•
Published
•
12
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video
Synthesis
Paper
•
2402.14797
•
Published
•
18
VideoPrism: A Foundational Visual Encoder for Video Understanding
Paper
•
2402.13217
•
Published
•
18
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Paper
•
2402.14289
•
Published
•
16
Mini-Gemini: Mining the Potential of Multi-modality Vision Language
Models
Paper
•
2403.18814
•
Published
•
37
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video
Understanding
Paper
•
2404.05726
•
Published
•
18
Koala: Key frame-conditioned long video-LLM
Paper
•
2404.04346
•
Published
•
5
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
Paper
•
2404.05961
•
Published
•
61
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper
•
2404.12390
•
Published
•
23