matlok
's Collections
Papers - Image
updated
FaceChain-SuDe: Building Derived Class to Inherit Category Attributes
for One-shot Subject-Driven Generation
Paper
•
2403.06775
•
Published
•
3
An Image is Worth 16x16 Words: Transformers for Image Recognition at
Scale
Paper
•
2010.11929
•
Published
•
6
Data Incubation -- Synthesizing Missing Data for Handwriting Recognition
Paper
•
2110.07040
•
Published
•
2
A Mixture of Expert Approach for Low-Cost Customization of Deep Neural
Networks
Paper
•
1811.00056
•
Published
•
2
Data Generation for Post-OCR correction of Cyrillic handwriting
Paper
•
2311.15896
•
Published
•
3
Character Queries: A Transformer-based Approach to On-Line Handwritten
Character Segmentation
Paper
•
2309.03072
•
Published
•
2
Densely Connected Convolutional Networks
Paper
•
1608.06993
•
Published
•
3
BigNAS: Scaling Up Neural Architecture Search with Big Single-Stage
Models
Paper
•
2003.11142
•
Published
•
2
U-Net: Convolutional Networks for Biomedical Image Segmentation
Paper
•
1505.04597
•
Published
•
8
Image Segmentation using U-Net Architecture for Powder X-ray Diffraction
Images
Paper
•
2310.16186
•
Published
•
2
RTSeg: Real-time Semantic Segmentation Comparative Study
Paper
•
1803.02758
•
Published
•
2
Generalizability vs. Robustness: Adversarial Examples for Medical
Imaging
Paper
•
1804.00504
•
Published
•
2
Hierarchical multi-class segmentation of glioma images using networks
with multi-level activation function
Paper
•
1810.09488
•
Published
•
2
IVD-Net: Intervertebral disc localization and segmentation in MRI with a
multi-modal UNet
Paper
•
1811.08305
•
Published
•
2
A multi-path 2.5 dimensional convolutional neural network system for
segmenting stroke lesions in brain MRI images
Paper
•
1905.10835
•
Published
•
2
Enforcing temporal consistency in Deep Learning segmentation of brain MR
images
Paper
•
1906.07160
•
Published
•
3
Bias Loss for Mobile Neural Networks
Paper
•
2107.11170
•
Published
•
2
Skip-Connected Neural Networks with Layout Graphs for Floor Plan
Auto-Generation
Paper
•
2309.13881
•
Published
•
2
Inter-Scale Dependency Modeling for Skin Lesion Segmentation with
Transformer-based Networks
Paper
•
2310.13727
•
Published
•
2
Latent Diffusion Model for Medical Image Standardization and Enhancement
Paper
•
2310.05237
•
Published
•
2
3D Medical Image Segmentation based on multi-scale MPU-Net
Paper
•
2307.05799
•
Published
•
2
Self-Supervised U-Net for Segmenting Flat and Sessile Polyps
Paper
•
2110.08776
•
Published
•
2
Enforcing Morphological Information in Fully Convolutional Networks to
Improve Cell Instance Segmentation in Fluorescence Microscopy Images
Paper
•
2106.05843
•
Published
•
2
Saliency-Guided Deep Learning Network for Automatic Tumor Bed Volume
Delineation in Post-operative Breast Irradiation
Paper
•
2105.02771
•
Published
•
2
Qutrit-inspired Fully Self-supervised Shallow Quantum Learning Network
for Brain Tumor Segmentation
Paper
•
2009.06767
•
Published
•
2
The Effects of Image Pre- and Post-Processing, Wavelet Decomposition,
and Local Binary Patterns on U-Nets for Skin Lesion Segmentation
Paper
•
1805.05239
•
Published
•
2
A joint 3D UNet-Graph Neural Network-based method for Airway
Segmentation from chest CTs
Paper
•
1908.08588
•
Published
•
2
Joint Liver and Hepatic Lesion Segmentation in MRI using a Hybrid CNN
with Transformer Layers
Paper
•
2201.10981
•
Published
•
2
CSWin Transformer: A General Vision Transformer Backbone with
Cross-Shaped Windows
Paper
•
2107.00652
•
Published
•
2
2nd Place Solution to Google Landmark Recognition Competition 2021
Paper
•
2110.02638
•
Published
•
2
BOAT: Bilateral Local Attention Vision Transformer
Paper
•
2201.13027
•
Published
•
2
Long-tailed Recognition by Routing Diverse Distribution-Aware Experts
Paper
•
2010.01809
•
Published
•
2
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Paper
•
2103.14030
•
Published
•
4
A Novel Transformer Based Semantic Segmentation Scheme for
Fine-Resolution Remote Sensing Images
Paper
•
2104.12137
•
Published
•
2
Self-Supervised Learning with Swin Transformers
Paper
•
2105.04553
•
Published
•
2
Bootstrap your own latent: A new approach to self-supervised Learning
Paper
•
2006.07733
•
Published
•
2
Evaluating Transformer-based Semantic Segmentation Networks for
Pathological Image Segmentation
Paper
•
2108.11993
•
Published
•
2
Using Multi-scale SwinTransformer-HTC with Data augmentation in CoNIC
Challenge
Paper
•
2202.13588
•
Published
•
2
From Modern CNNs to Vision Transformers: Assessing the Performance,
Robustness, and Classification Strategies of Deep Learning Models in
Histopathology
Paper
•
2204.05044
•
Published
•
2
Emerging Properties in Self-Supervised Vision Transformers
Paper
•
2104.14294
•
Published
•
3
GasHis-Transformer: A Multi-scale Visual Transformer Approach for
Gastric Histopathological Image Detection
Paper
•
2104.14528
•
Published
•
2
CheXagent: Towards a Foundation Model for Chest X-Ray Interpretation
Paper
•
2401.12208
•
Published
•
22
Paper
•
2309.16671
•
Published
•
20
Vision Transformers Need Registers
Paper
•
2309.16588
•
Published
•
77
DAS: A Deformable Attention to Capture Salient Information in CNNs
Paper
•
2311.12091
•
Published
•
2
TANKER: Distributed Architecture for Named Entity Recognition and
Disambiguation
Paper
•
1708.09230
•
Published
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper
•
2403.09611
•
Published
•
124
Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering
Paper
•
2403.09622
•
Published
•
16
VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision
Understanding
Paper
•
2403.09530
•
Published
•
8
LocalMamba: Visual State Space Model with Windowed Selective Scan
Paper
•
2403.09338
•
Published
•
7
GiT: Towards Generalist Vision Transformer through Universal Language
Interface
Paper
•
2403.09394
•
Published
•
25
Vision Transformer with Quadrangle Attention
Paper
•
2303.15105
•
Published
•
2
StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based
Semantic Control
Paper
•
2403.09055
•
Published
•
24
Language Grounded QFormer for Efficient Vision Language Understanding
Paper
•
2311.07449
•
Published
•
2
GLIDE: Towards Photorealistic Image Generation and Editing with
Text-Guided Diffusion Models
Paper
•
2112.10741
•
Published
•
3
Synthetic Shifts to Initial Seed Vector Exposes the Brittle Nature of
Latent-Based Diffusion Models
Paper
•
2312.11473
•
Published
•
2
Lightweight Image Inpainting by Stripe Window Transformer with Joint
Attention to CNN
Paper
•
2301.00553
•
Published
•
2
Semi-Supervised Semantic Segmentation using Redesigned Self-Training for
White Blood Cells
Paper
•
2401.07278
•
Published
•
2
Flamingo: a Visual Language Model for Few-Shot Learning
Paper
•
2204.14198
•
Published
•
14
VideoAgent: Long-form Video Understanding with Large Language Model as
Agent
Paper
•
2403.10517
•
Published
•
32
LightIt: Illumination Modeling and Control for Diffusion Models
Paper
•
2403.10615
•
Published
•
16
Generic 3D Diffusion Adapter Using Controlled Multi-View Editing
Paper
•
2403.12032
•
Published
•
14
MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data
Paper
•
2403.11207
•
Published
•
14
AnimateDiff-Lightning: Cross-Model Diffusion Distillation
Paper
•
2403.12706
•
Published
•
17
FouriScale: A Frequency Perspective on Training-Free High-Resolution
Image Synthesis
Paper
•
2403.12963
•
Published
•
7
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Paper
•
2403.11703
•
Published
•
16
TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation
Paper
•
2403.12906
•
Published
•
5
Towards 3D Molecule-Text Interpretation in Language Models
Paper
•
2401.13923
•
Published
•
9
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal
Large Language Models
Paper
•
2403.13447
•
Published
•
18
MyVLM: Personalizing VLMs for User-Specific Queries
Paper
•
2403.14599
•
Published
•
15
S2LIC: Learned Image Compression with the SwinV2 Block, Adaptive
Channel-wise and Global-inter Attention Context
Paper
•
2403.14471
•
Published
•
2
DepthFM: Fast Monocular Depth Estimation with Flow Matching
Paper
•
2403.13788
•
Published
•
17
SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions
Paper
•
2403.16627
•
Published
•
20
FlashFace: Human Image Personalization with High-fidelity Identity
Preservation
Paper
•
2403.17008
•
Published
•
19
Prompt me a Dataset: An investigation of text-image prompting for
historical image dataset creation using foundation models
Paper
•
2309.01674
•
Published
•
2
Paper
•
2304.02643
•
Published
•
3
MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for
Vision-Language Few-Shot Prompting
Paper
•
2210.07179
•
Published
•
3
DreamPolisher: Towards High-Quality Text-to-3D Generation via Geometric
Diffusion
Paper
•
2403.17237
•
Published
•
9
One-step Diffusion with Distribution Matching Distillation
Paper
•
2311.18828
•
Published
•
3
The Unreasonable Effectiveness of Deep Features as a Perceptual Metric
Paper
•
1801.03924
•
Published
•
2
ViTAR: Vision Transformer with Any Resolution
Paper
•
2403.18361
•
Published
•
52
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Paper
•
2404.01197
•
Published
•
30
Condition-Aware Neural Network for Controlled Image Generation
Paper
•
2404.01143
•
Published
•
11
Measuring Style Similarity in Diffusion Models
Paper
•
2404.01292
•
Published
•
16
InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image
Generation
Paper
•
2404.02733
•
Published
•
20
Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion
Models
Paper
•
2404.02747
•
Published
•
11
Event Camera Demosaicing via Swin Transformer and Pixel-focus Loss
Paper
•
2404.02731
•
Published
•
1
PointInfinity: Resolution-Invariant Point Diffusion Models
Paper
•
2404.03566
•
Published
•
13
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept
Matching
Paper
•
2404.03653
•
Published
•
33
Learning Transferable Visual Models From Natural Language Supervision
Paper
•
2103.00020
•
Published
•
11
Prompt-to-Prompt Image Editing with Cross Attention Control
Paper
•
2208.01626
•
Published
•
2
DeViDe: Faceted medical knowledge for improved medical vision-language
pre-training
Paper
•
2404.03618
•
Published
•
2
OmniFusion Technical Report
Paper
•
2404.06212
•
Published
•
74
Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and
Latent Diffusion
Paper
•
2310.03502
•
Published
•
77
Toward a Better Understanding of Fourier Neural Operators: Analysis and
Improvement from a Spectral Perspective
Paper
•
2404.07200
•
Published
•
1
Paper
•
2404.07821
•
Published
•
11
ConsistencyDet: Robust Object Detector with Denoising Paradigm of
Consistency Model
Paper
•
2404.07773
•
Published
•
1
ODA: Observation-Driven Agent for integrating LLMs and Knowledge Graphs
Paper
•
2404.07677
•
Published
•
1
Ferret-v2: An Improved Baseline for Referring and Grounding with Large
Language Models
Paper
•
2404.07973
•
Published
•
30
Text Role Classification in Scientific Charts Using Multimodal
Transformers
Paper
•
2402.14579
•
Published
•
1
Using Explainable AI and Transfer Learning to understand and predict the
maintenance of Atlantic blocking with limited observational data
Paper
•
2404.08613
•
Published
•
1
HSIDMamba: Exploring Bidirectional State-Space Models for Hyperspectral
Denoising
Paper
•
2404.09697
•
Published
•
1
Deformable MRI Sequence Registration for AI-based Prostate Cancer
Diagnosis
Paper
•
2404.09666
•
Published
•
1
Comprehensive Survey of Model Compression and Speed up for Vision
Transformers
Paper
•
2404.10407
•
Published
•
1
Explainable Lung Disease Classification from Chest X-Ray Images
Utilizing Deep Learning and XAI
Paper
•
2404.11428
•
Published
•
1
MoA: Mixture-of-Attention for Subject-Context Disentanglement in
Personalized Image Generation
Paper
•
2404.11565
•
Published
•
14
EdgeFusion: On-Device Text-to-Image Generation
Paper
•
2404.11925
•
Published
•
21
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
Paper
•
2401.18059
•
Published
•
36
GLIGEN: Open-Set Grounded Text-to-Image Generation
Paper
•
2301.07093
•
Published
•
3
Grounded Language-Image Pre-training
Paper
•
2112.03857
•
Published
•
3
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Paper
•
2404.12803
•
Published
•
29
Groma: Localized Visual Tokenization for Grounding Multimodal Large
Language Models
Paper
•
2404.13013
•
Published
•
30
Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image
Synthesis
Paper
•
2404.13686
•
Published
•
27
Scene Coordinate Reconstruction: Posing of Image Collections via
Incremental Learning of a Relocalizer
Paper
•
2404.14351
•
Published
•
5
MultiBooth: Towards Generating All Your Concepts in an Image from Text
Paper
•
2404.14239
•
Published
•
8
All you need is a good init
Paper
•
1511.06422
•
Published
•
1
Efficient Transformer Encoders for Mask2Former-style models
Paper
•
2404.15244
•
Published
•
1
Deep Residual Learning for Image Recognition
Paper
•
1512.03385
•
Published
•
6
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster
Pre-training on Web-scale Image-Text Data
Paper
•
2404.15653
•
Published
•
26
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
Dense Captioning
Paper
•
2404.16994
•
Published
•
35
HaLo-NeRF: Learning Geometry-Guided Semantics for Exploring
Unconstrained Photo Collections
Paper
•
2404.16845
•
Published
•
6
Stylus: Automatic Adapter Selection for Diffusion Models
Paper
•
2404.18928
•
Published
•
14
DOCCI: Descriptions of Connected and Contrasting Images
Paper
•
2404.19753
•
Published
•
11
Paint by Inpaint: Learning to Add Image Objects by Removing Them First
Paper
•
2404.18212
•
Published
•
27
StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video
Generation
Paper
•
2405.01434
•
Published
•
52
Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models
Paper
•
2405.16759
•
Published
•
7
Neural Autoregressive Distribution Estimation
Paper
•
1605.02226
•
Published
•
1
Autoregressive Model Beats Diffusion: Llama for Scalable Image
Generation
Paper
•
2406.06525
•
Published
•
65
Diffusion Models Beat GANs on Image Synthesis
Paper
•
2105.05233
•
Published
•
2
Zero-shot Image Editing with Reference Imitation
Paper
•
2406.07547
•
Published
•
31
VideoFACT: Detecting Video Forgeries Using Attention, Scene Context, and
Forensic Traces
Paper
•
2211.15775
•
Published
•
1
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
Paper
•
2406.06911
•
Published
•
10
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation
in Videos
Paper
•
2406.08407
•
Published
•
24
DataComp: In search of the next generation of multimodal datasets
Paper
•
2304.14108
•
Published
•
2
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal
LLMs
Paper
•
2406.18521
•
Published
•
28
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and
Understanding
Paper
•
2406.19389
•
Published
•
52
Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept
Space
Paper
•
2406.19370
•
Published
•
1
Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity
Paper
•
2406.17720
•
Published
•
7
We-Math: Does Your Large Multimodal Model Achieve Human-like
Mathematical Reasoning?
Paper
•
2407.01284
•
Published
•
75
No Training, No Problem: Rethinking Classifier-Free Guidance for
Diffusion Models
Paper
•
2407.02687
•
Published
•
22
TokenPacker: Efficient Visual Projector for Multimodal LLM
Paper
•
2407.02392
•
Published
•
21
InternLM-XComposer-2.5: A Versatile Large Vision Language Model
Supporting Long-Contextual Input and Output
Paper
•
2407.03320
•
Published
•
92
DisCo-Diff: Enhancing Continuous Diffusion Models with Discrete Latents
Paper
•
2407.03300
•
Published
•
11
Florence-2: Advancing a Unified Representation for a Variety of Vision
Tasks
Paper
•
2311.06242
•
Published
•
85
Unveiling Encoder-Free Vision-Language Models
Paper
•
2406.11832
•
Published
•
50
Vision language models are blind
Paper
•
2407.06581
•
Published
•
82
MAVIS: Mathematical Visual Instruction Tuning
Paper
•
2407.08739
•
Published
•
30
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning
Instruction Using Language Model
Paper
•
2407.07053
•
Published
•
41
PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding
Paper
•
2312.04461
•
Published
•
57
Paper
•
2405.15932
•
Published
•
1
SAM 2: Segment Anything in Images and Videos
Paper
•
2408.00714
•
Published
•
108
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Paper
•
2403.03206
•
Published
•
59
LLaVA-OneVision: Easy Visual Task Transfer
Paper
•
2408.03326
•
Published
•
59
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal
Large Language Models
Paper
•
2408.04840
•
Published
•
32
Paper
•
2408.07009
•
Published
•
61
VideoGameBunny: Towards vision assistants for video games
Paper
•
2407.15295
•
Published
•
21
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper
•
2408.08872
•
Published
•
97
Equivariant Transformer Networks
Paper
•
1901.11399
•
Published
•
1
Law of Vision Representation in MLLMs
Paper
•
2408.16357
•
Published
•
92
Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse
Autoencoders
Paper
•
2410.22366
•
Published
•
75
OmniGen: Unified Image Generation
Paper
•
2409.11340
•
Published
•
108
Randomized Autoregressive Visual Generation
Paper
•
2411.00776
•
Published
•
17
Analyzing The Language of Visual Tokens
Paper
•
2411.05001
•
Published
•
21
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
Paper
•
2411.02327
•
Published
•
11
Smoothed Energy Guidance: Guiding Diffusion Models with Reduced Energy
Curvature of Attention
Paper
•
2408.00760
•
Published
•
6
Generalized Out-of-Distribution Detection and Beyond in Vision Language
Model Era: A Survey
Paper
•
2407.21794
•
Published
•
5
Building and better understanding vision-language models: insights and
future directions
Paper
•
2408.12637
•
Published
•
122
MagicQuill: An Intelligent Interactive Image Editing System
Paper
•
2411.09703
•
Published
•
56
BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed
Dual-Branch Diffusion
Paper
•
2403.06976
•
Published
•
2
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper
•
2411.10440
•
Published
•
107
Multimodal Autoregressive Pre-training of Large Vision Encoders
Paper
•
2411.14402
•
Published
•
40
Stochastic Interpolants: A Unifying Framework for Flows and Diffusions
Paper
•
2303.08797
•
Published
•
1
DETRs Beat YOLOs on Real-time Object Detection
Paper
•
2304.08069
•
Published
•
12
RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time
Detection Transformer
Paper
•
2407.17140
•
Published
•
1
HAT: Hybrid Attention Transformer for Image Restoration
Paper
•
2309.05239
•
Published
•
1