-
Prompt me a Dataset: An investigation of text-image prompting for historical image dataset creation using foundation models
Paper • 2309.01674 • Published • 2 -
Segment Anything
Paper • 2304.02643 • Published • 2 -
EgoLifter: Open-world 3D Segmentation for Egocentric Perception
Paper • 2403.18118 • Published • 9 -
A Multimodal Automated Interpretability Agent
Paper • 2404.14394 • Published • 20
Collections
Discover the best community collections!
Collections including paper arxiv:2404.14394
-
Demystifying CLIP Data
Paper • 2309.16671 • Published • 19 -
Model Stock: All we need is just a few fine-tuned models
Paper • 2403.19522 • Published • 10 -
Bigger is not Always Better: Scaling Properties of Latent Diffusion Models
Paper • 2404.01367 • Published • 19 -
On the Scalability of Diffusion-based Text-to-Image Generation
Paper • 2404.02883 • Published • 17
-
Self-Supervised Vision Transformers Learn Visual Concepts in Histopathology
Paper • 2203.00585 • Published • 2 -
Emerging Properties in Self-Supervised Vision Transformers
Paper • 2104.14294 • Published • 3 -
DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting
Paper • 2404.06903 • Published • 17 -
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
Paper • 2404.07973 • Published • 30
-
Wide Residual Networks
Paper • 1605.07146 • Published • 2 -
Characterizing signal propagation to close the performance gap in unnormalized ResNets
Paper • 2101.08692 • Published • 2 -
Pareto-Optimal Quantized ResNet Is Mostly 4-bit
Paper • 2105.03536 • Published • 2 -
When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations
Paper • 2106.01548 • Published • 2
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 25 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 12 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 36 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 19
-
LLM Augmented LLMs: Expanding Capabilities through Composition
Paper • 2401.02412 • Published • 36 -
LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration
Paper • 2402.11550 • Published • 15 -
A Multimodal Automated Interpretability Agent
Paper • 2404.14394 • Published • 20