Maxim Kurkin's picture

4 40

Maxim Kurkin

dondosss

·

Fr0do

AI & ML interests

Multimodal AI, CV, NLP

Organizations

dondosss's activity

upvoted a paper 27 days ago

Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22 • 110

upvoted a paper 28 days ago

Wavelets Are All You Need for Autoregressive Image Generation

Paper • 2406.19997 • Published Jun 28 • 28

upvoted 2 papers about 1 month ago

Qwen2-Audio Technical Report

Paper • 2407.10759 • Published Jul 15 • 53

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Paper • 2408.08872 • Published Aug 16 • 96

upvoted 3 papers about 2 months ago

Generative Photomontage

Paper • 2408.07116 • Published Aug 13 • 19

Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents

Paper • 2408.07060 • Published Aug 13 • 39

Medical SAM 2: Segment medical images as video via Segment Anything Model 2

Paper • 2408.00874 • Published Aug 1 • 41

upvoted a paper 2 months ago

Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference

Paper • 2403.14520 • Published Mar 21 • 32

upvoted a collection 2 months ago

🍃 MINT-1T

Data for "MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens" • 13 items • Updated Jul 24 • 49

upvoted 2 papers 2 months ago

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

Paper • 2407.16198 • Published Jul 23 • 13

POGEMA: A Benchmark Platform for Cooperative Multi-Agent Navigation

Paper • 2407.14931 • Published Jul 20 • 20

upvoted 8 papers 3 months ago

PaliGemma: A versatile 3B VLM for transfer

Paper • 2407.07726 • Published Jul 10 • 65

TabReD: A Benchmark of Tabular Machine Learning in-the-Wild

Paper • 2406.19380 • Published Jun 27 • 47

Make It Count: Text-to-Image Generation with an Accurate Number of Objects

Paper • 2406.10210 • Published Jun 14 • 76

Aligning Diffusion Models with Noise-Conditioned Perception

Paper • 2406.17636 • Published Jun 25 • 26

HARE: HumAn pRiors, a key to small language model Efficiency

Paper • 2406.11410 • Published Jun 17 • 38

The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN Inversion and High Quality Image Editing

Paper • 2406.10601 • Published Jun 15 • 65

nabla^2DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials

Paper • 2406.14347 • Published Jun 20 • 99

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Paper • 2311.06242 • Published Nov 10, 2023 • 79

upvoted 6 papers 4 months ago

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Paper • 2406.10149 • Published Jun 14 • 48

XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning

Paper • 2406.08973 • Published Jun 13 • 85

GPT Understands, Too

Paper • 2103.10385 • Published Mar 18, 2021 • 7

An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27 • 85

Vikhr: The Family of Open-Source Instruction-Tuned Large Language Models for Russian

Paper • 2405.13929 • Published May 22 • 51

Your Transformer is Secretly Linear

Paper • 2405.12250 • Published May 19 • 150

upvoted 2 articles 5 months ago

Article

PaliGemma – Google's Cutting-Edge Open Vision Language Model

May 14

• 200

Article

License to Call: Introducing Transformers Agents 2.0

May 13

• 108

upvoted a collection 5 months ago

Models - Multimodal

16 items • Updated 20 days ago • 2

upvoted a paper 5 months ago

M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition

Paper • 2401.11649 • Published Jan 22 • 3

upvoted a paper 6 months ago

MoA: Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation

Paper • 2404.11565 • Published Apr 17 • 14

upvoted an article 6 months ago

Article

Introducing Idefics2: A Powerful 8B Vision-Language Model for the community

Apr 15

• 161

upvoted a collection 6 months ago

SigLIP

Contrastive (sigmoid) image-text models from https://arxiv.org/abs/2303.15343 • 8 items • Updated Jul 31 • 34

upvoted 2 papers 6 months ago

Learn Your Reference Model for Real Good Alignment

Paper • 2404.09656 • Published Apr 15 • 82

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

Paper • 2404.06512 • Published Apr 9 • 29

upvoted 3 papers 8 months ago

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Paper • 2401.15947 • Published Jan 29 • 48

MM-LLMs: Recent Advances in MultiModal Large Language Models

Paper • 2401.13601 • Published Jan 24 • 44

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Paper • 2312.00752 • Published Dec 1, 2023 • 138

upvoted a collection 9 months ago

vikhr

A family of russian translated LLM • 4 items • Updated Jan 8 • 15

upvoted a paper 9 months ago

DocLLM: A layout-aware generative language model for multimodal document understanding

Paper • 2401.00908 • Published Dec 31, 2023 • 178

upvoted a paper 10 months ago

FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline

Paper • 2311.13073 • Published Nov 22, 2023 • 56