Collections
Discover the best community collections!
Collections including paper arxiv:2408.12569
-
PDFTriage: Question Answering over Long, Structured Documents
Paper • 2309.08872 • Published • 54 -
Adapting Large Language Models via Reading Comprehension
Paper • 2309.09530 • Published • 77 -
Table-GPT: Table-tuned GPT for Diverse Table Tasks
Paper • 2310.09263 • Published • 40 -
Context-Aware Meta-Learning
Paper • 2310.10971 • Published • 17
-
What matters when building vision-language models?
Paper • 2405.02246 • Published • 102 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 88 -
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
Paper • 2407.03320 • Published • 95 -
Building and better understanding vision-language models: insights and future directions
Paper • 2408.12637 • Published • 126
-
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels
Paper • 2406.09415 • Published • 51 -
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
Paper • 2406.09406 • Published • 15 -
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
Paper • 2406.10227 • Published • 9 -
What If We Recaption Billions of Web Images with LLaMA-3?
Paper • 2406.08478 • Published • 40
-
Depth Anything V2
Paper • 2406.09414 • Published • 97 -
Controllable Text Generation for Large Language Models: A Survey
Paper • 2408.12599 • Published • 65 -
Sapiens: Foundation for Human Vision Models
Paper • 2408.12569 • Published • 90 -
Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise
Paper • 2410.03017 • Published • 27
-
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
Paper • 2405.08748 • Published • 24 -
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Paper • 2405.10300 • Published • 29 -
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper • 2405.09818 • Published • 131 -
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Paper • 2405.11143 • Published • 38
-
LocalMamba: Visual State Space Model with Windowed Selective Scan
Paper • 2403.09338 • Published • 9 -
GiT: Towards Generalist Vision Transformer through Universal Language Interface
Paper • 2403.09394 • Published • 27 -
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Paper • 2402.19479 • Published • 34 -
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Paper • 2405.10300 • Published • 29