MAOAM: Unified Object and Material Selection with Vision-Language Models

MAOAM (Mask Any Object And Material) is a unified selection framework that enables precise object- and material-level selection across both text- and click-based interactions. It leverages a Vision-Language Model (VLM) with a segmentation head to produce pixel-accurate masks from user prompts.

Project Page | Paper (ArXiv) | Code

Features

Unified Selection: Supports both object-centric (e.g., "the chair") and material-centric (e.g., "the wood") selection.
Multi-modal Interaction: Handles user input via natural language descriptions, point clicks, or a combination of both.
High Precision: Integrates segmentation heads (like SAM and SAM 2) with VLM backbones (GLaMM or Sa2VA) to decode visual semantics into pixel-accurate boundaries.
Robust Material Understanding: Trained on a novel pipeline incorporating real and synthetic images with rich material-specific textual descriptions and VQA tasks.

Model Variants

The official implementation supports two backbones:

MAOAM-GLaMM: Based on LLaVA-Llama (GLaMM) and SAM ViT-H.
MAOAM-Sa2VA: Based on Qwen2.5-VL-7B and SAM2 Hiera-L.

Usage

For detailed instructions on setup, downloading weights, and running the Gradio demo, please refer to the official GitHub repository.

Citation

@inproceedings{park2026maoam,
  title     = {MAOAM: Unified Object and Material Selection with Vision-Language Models},
  author    = {Park, Jaden and Deschaintre, Valentin and Kuen, Jason and
               Liu, Kangning and Georgiev, Iliyan and Singh, Krishna Kumar and
               Lee, Yong Jae and Fischer, Michael},
  booktitle = {ACM SIGGRAPH 2026 Conference Papers},
  year      = {2026},
  publisher = {ACM},
  doi       = {10.1145/3799902.3811186},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train jpark677/maoam_ckpts

Paper for jpark677/maoam_ckpts

MAOAM: Unified Object and Material Selection with Vision-Language Models

Paper • 2606.04880 • Published 6 days ago • 9