mikewang commited on
Commit
c5af509
1 Parent(s): 55e5797

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -3
README.md CHANGED
@@ -3,12 +3,13 @@ license: apache-2.0
3
  datasets:
4
  - mikewang/PVD-160K
5
  ---
 
6
  <h1 align="center"> Text-Based Reasoning About Vector Graphics </h1>
7
 
8
  <p align="center">
9
- <a href="https://mikewangwzhl.github.io/VDLM/">🌐 Homepage</a>
10
 
11
- <a href="">📃 Paper</a>
12
 
13
  <a href="https://huggingface.co/datasets/mikewang/PVD-160K" >🤗 Data (PVD-160k)</a>
14
 
@@ -18,6 +19,11 @@ datasets:
18
 
19
  </p>
20
 
21
- We propose **VDLM**, a text-based visual reasoning framework for vector graphics. VDLM operates on text-based visual descriptions—specifically, SVG representations and learned Primal Visual Descriptions (PVD), enabling zero-shot reasoning with an off-the-shelf LLM. We demonstrate that VDLM outperforms state-of-the-art large multimodal models, such as GPT-4V, across various multimodal reasoning tasks involving vector graphics. See our [paper]() for more details.
 
 
 
 
 
22
 
23
  ![Overview](https://github.com/MikeWangWZHL/VDLM/blob/main/figures/overview.png?raw=true)
 
3
  datasets:
4
  - mikewang/PVD-160K
5
  ---
6
+
7
  <h1 align="center"> Text-Based Reasoning About Vector Graphics </h1>
8
 
9
  <p align="center">
10
+ <a href="https://mikewangwzhl.github.io/VDLM">🌐 Homepage</a>
11
 
12
+ <a href="">📃 Paper (Coming Soon)</a>
13
 
14
  <a href="https://huggingface.co/datasets/mikewang/PVD-160K" >🤗 Data (PVD-160k)</a>
15
 
 
19
 
20
  </p>
21
 
22
+
23
+ We observe that current *large multimodal models (LMMs)* still struggle with seemingly straightforward reasoning tasks that require precise perception of low-level visual details, such as identifying spatial relations or solving simple mazes. In particular, this failure mode persists in question-answering tasks about vector graphics—images composed purely of 2D objects and shapes.
24
+
25
+ ![Teaser](https://github.com/MikeWangWZHL/VDLM/blob/main/figures/teaser.png?raw=true)
26
+
27
+ To solve this challenge, we propose **Visually Descriptive Language Model (VDLM)**, a text-based visual reasoning framework for vector graphics. VDLM operates on text-based visual descriptions—specifically, SVG representations and learned Primal Visual Descriptions (PVD), enabling zero-shot reasoning with an off-the-shelf LLM. We demonstrate that VDLM outperforms state-of-the-art large multimodal models, such as GPT-4V, across various multimodal reasoning tasks involving vector graphics. See our [paper (coming soon)]() for more details.
28
 
29
  ![Overview](https://github.com/MikeWangWZHL/VDLM/blob/main/figures/overview.png?raw=true)