PANCAKE-Qwen3-VL-8B

PANCAKE: Purpose And Context Activate Knowledge Efficiently

A fine-tuned vision-language model for geometric problem-solving, trained with a structured Chain-of-Thought methodology on top of Qwen3-VL-8B.

Overview

PANCAKE introduces a geometry-specific, stage-wise reasoning framework that organizes problem-solving into three structured stages:

Purpose — Explicitly identifies the core objective of the problem in a single sentence, guiding the model to establish a clear solution trajectory.
Description — Extracts and articulates essential visual information from the image (numerical values, geometric properties, coordinates).
Think — Performs logical derivation based on Purpose and Description, executing sequential reasoning steps to arrive at the final answer.

This structured CoT approach is trained via a two-stage pipeline:

Stage 1: Supervised Fine-Tuning (SFT) on high-quality PANCAKE-format data generated by Gemini-2.5-Pro
Stage 2: Direct Preference Optimization (DPO) using SFT model's incorrect outputs as rejected samples

Performance

Geometry3K Benchmark

Method	Accuracy (%)
PANCAKE (DPO) — Ours	70.0
Inter-GPS	57.5
Intern-S1	52.3
Qwen3-VL-8B (Think-only baseline)	53.6

Ablation: PANCAKE Component Contribution (Geometry3K)

Configuration	Accuracy (%)	Improvement
Baseline (Think only)	53.6	—
+ Description	62.5	+6.9 pp
+ Purpose + Description (PANCAKE)	66.7	+13.1 pp
PANCAKE (DPO)	70.0	+16.4 pp

Structured Reasoning vs. Token Length (Geometry3K)

Method	Avg Tokens	Accuracy (%)
PANCAKE	~490	66.7
Long-Think (token-matched)	~490	57.5

PANCAKE outperforms a token-comparable unstructured baseline by 9.2 percentage points, confirming that gains stem from structured reasoning design, not mere token count.

UniGeo Generalization

Method	Accuracy (%)
PANCAKE (DPO) — Ours	79.0
GOLD	75.2
GAPS	67.8
PANCAKE (SFT)	78.1

Method

PANCAKE Data Format

Each training sample consists of three components generated by Gemini-2.5-Pro:

Purpose: This problem is designed to test the ability to identify ...
Description: The image shows a large triangle ... The vertical side is segmented ...
Think: The goal is to find m∠3 ... subtracting 164° from 180° gives m∠3 = 16°.
Answer: 16

Training Pipeline

Data Synthesis: Gemini-2.5-Pro generates Purpose → Description → Think responses for Geometry3K problems. Samples are iteratively generated until the predicted answer matches the ground truth.
SFT: Qwen3-VL-8B is fine-tuned on PANCAKE data using LoRA on an RTX A6000 GPU.
DPO: Preference pairs are constructed using PANCAKE data as chosen and SFT model's incorrect responses as rejected. DPO is applied to reinforce correct logical pathways.

Base Model

Architecture: Qwen3-VL (8B parameters)
Fine-tuning method: LoRA (Low-Rank Adaptation)
Training hardware: NVIDIA RTX A6000

Datasets

Geometry3K: 3,002 geometry problems from American high school math textbooks (grades 9–12). Split: 2,101 train / 300 validation / 601 test.
UniGeo (generalization eval): Large-scale high school geometry benchmark; calculation subset used (3,499 train / 745 val / 754 test).

Model Details

Property	Value
Base model	Qwen3-VL-8B
Architecture	Qwen3VLForConditionalGeneration
Parameters	~8B
dtype	float16
Hidden size	4096
Attention heads	32
KV heads	8
Hidden layers	36
Max position embeddings	262,144
Vision encoder hidden size	1,152

Citation

@article{pancake2025,
  title={PANCAKE: Purpose And Context Activate Knowledge Efficiently},
  author={Chae-Yun Jung and Yi Seung},
  year={2025},
  institution={St. Johnsbury Academy, Jeju, Korea; Asia Pacific International School, Seoul, Korea}
}

Authors

Chae-Yun Jung — St. Johnsbury Academy, Jeju, Korea
Yi Seung — Asia Pacific International School, Seoul, Korea

Downloads last month: 2

Safetensors

Model size

9B params

Tensor type

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support