README / README.md
msudharsanan's picture
Update README.md
7371f1c verified
metadata
title: Denali AI
short_description: VLMs for Garment Attribute Extraction

Denali AI β€” Vision-Language Models for Garment Classification

Advancing structured attribute extraction from garment images through multi-stage reinforcement learning

Models Benchmark License Best Score


Abstract

Denali AI develops and benchmarks vision-language models (VLMs) for structured garment attribute extraction β€” the task of analyzing a garment image and producing a complete JSON object describing 9 key attributes: type, color, pattern, neckline, sleeve length, closure, brand, size, and defect type.

We systematically evaluate the impact of supervised fine-tuning (SFT), Group Relative Policy Optimization (GRPO), and Group-relative Trajectory-based Policy Optimization (GTPO) across multiple model architectures (Qwen3-VL, Qwen3.5-VL, InternVL3, Florence-2, Moondream2, Phi-4) and scales (1.6B to 122B parameters). Our best model, Qwen3-VL-8B SFT+GRPO, achieves 91.3% weighted score with 100% JSON parse rate on the eval_hard_3500 benchmark.


Leaderboard

Model Leaderboard

Rank Model Architecture Params Training Weighted SBERT+NLI JSON Parse Throughput
1 Qwen3-VL-8B SFT+GRPO Qwen3-VL 8B SFT+GRPO 91.3% 78.7% 100% 7.5/s
2 Qwen3-VL-2B-SFT-GRPO-v9 Qwen3-VL 2B SFT+GRPO 89.5% 78.5% 100% 15.9/s
3 Qwen3-VL-8B SFT+GRPO NVFP4 Qwen3-VL 8B SFT+GRPO 89.5% 77.0% 100% 12.1/s
4 Qwen3-VL-8B-Instruct-Base Qwen3-VL 8B Zero-shot 87.5% 75.6% 100% 5.5/s
5 Qwen3-VL-8B-Instruct NVFP4 Qwen3-VL 8B Zero-shot 87.2% 75.0% 100% 8.2/s
6 Qwen3.5-VL-2B Base Qwen3.5-VL 2B Zero-shot 84.4% 73.0% 100% 6.6/s
7 Qwen3-VL-2B SFT+GRPO v9 NVFP4 Qwen3-VL 2B SFT+GRPO 84.2% 74.1% 100% 17.2/s
8 Qwen3-VL-2B-Instruct Base Qwen3-VL 2B Zero-shot 76.4% 66.7% 100% 15.1/s
9 InternVL3-2B GRPO+GTPO Full InternVL3 2B GRPO+GTPO 72.7% 64.3% 100% 11.8/s
10 InternVL3-2B GRPO+GTPO FP8 InternVL3 2B GRPO+GTPO 72.2% 63.8% 100% 14.3/s
11 InternVL3-2B Base InternVL3 2B Zero-shot 71.8% 63.7% 100% 11.8/s
12 Moondream2 Base Moondream2 1.6B Zero-shot 69.8% 61.8% 100% 1.4/s
13 Qwen3.5-VL-2B SFT+GRPO+GTPO Qwen3.5-VL 2B SFT+GRPO+GTPO 65.3% 60.1% 100% 11.3/s
14 Qwen3.5-VL-2B SFT Qwen3.5-VL 2B SFT 63.7% 58.9% 100% 11.6/s
15 Qwen3.5-VL-35B GPTQ-Int4 Qwen3.5-VL MoE 35B (3B) Zero-shot 50.7% 48.7% 14% 1.6/s
16 Qwen3.5-VL-9B NVFP4 Qwen3.5-VL 9B Zero-shot 47.0% 46.0% 8% 1.7/s
17 Qwen3.5-VL-9B SFT NVFP4 Qwen3.5-VL 9B SFT 46.3% 45.5% 8% 1.7/s
18 Qwen3.5-VL-2B Base NVFP4 Qwen3.5-VL 2B Zero-shot 42.9% 42.9% 0% 4.0/s
19 Qwen3.5-VL-122B NVFP4 Qwen3.5-VL MoE 122B (10B) Zero-shot 42.9% 42.9% 0% 1.2/s
20 Qwen3.5-VL-2B SFT NVFP4 Qwen3.5-VL 2B SFT 42.9% 42.9% 0% 4.0/s
21 Qwen3.5-VL-2B SFT+GRPO+GTPO NVFP4 Qwen3.5-VL 2B SFT+GRPO+GTPO 42.9% 42.9% 0% 3.9/s
22 Phi-4-Multimodal NVFP4 Phi-4 5.6B Zero-shot 42.9% 42.9% 0% β€”

Note: Models ranked 18-22 have 0% JSON parse rate under NVFP4 quantization, meaning they cannot produce valid structured output β€” their weighted scores reflect the 42.9% floor from partial field matches in malformed outputs. Fine-tuning is required to unlock their potential.


Task Definition

Given a single garment image, the model must extract 9 structured attributes as a valid JSON object:

{
  "type": "t-shirt",
  "color": "navy blue",
  "pattern": "solid",
  "neckline": "crew neck",
  "sleeve_length": "short sleeve",
  "closure": "pullover",
  "brand": "Nike",
  "size": "M",
  "defect_type": "small hole on left shoulder"
}

Field Importance Weights

Not all fields are equally important. The weighted score uses domain-specific multipliers:

Field Weights

Field Weight Rationale
Type 2.5x Critical for inventory routing and categorization
Defect 2.0x Directly impacts quality control and pricing
Brand 1.5x Essential for authentication and valuation
Size 1.5x Required for accurate listing and search
Color, Pattern, Neckline, Sleeve, Closure 1.0x Standard descriptive attributes

Key Results

Per-Field Performance

Radar Comparison

Performance Heatmap

Accuracy vs Throughput

Throughput Analysis

Key finding: Qwen3-VL-2B v9 NVFP4 achieves the best accuracy-throughput trade-off at 84.2% weighted score and 17.2 samples/s β€” making it the Pareto-optimal choice for production deployment. For maximum accuracy, the Qwen3-VL-8B SFT+GRPO reaches 91.3% at 7.5 samples/s.

Structured Output Reliability

JSON Parse Rates

Fine-tuned models achieve 100% JSON parse rate, while zero-shot baselines (GPTQ, NVFP4) fail to produce valid JSON in 86-100% of cases. This demonstrates that SFT is essential for teaching structured output format, regardless of model scale.

Impact of Training Stages

Training Impact

Left panel: Adding GRPO+GTPO to Qwen3.5-2B improves brand recognition from 15.6% to 24.8% and defect detection from 89.5% to 95.1%, with a +1.6% overall gain.

Right panel: FP8 quantization of InternVL3-2B shows <1% accuracy degradation across all fields while reducing memory footprint, confirming FP8 as a practical deployment optimization.


Model Collections

By Architecture

Collection Models Description
Qwen3-VL 7 Top-performing Qwen3-VL based models (2B and 8B)
Qwen3.5-VL 10 Qwen3.5-VL models (0.8B to 122B)
InternVL3 6 InternVL3 models (1B, 2B)
Florence-2 3 Florence-2 encoder-decoder models
Benchmarks 2 Evaluation and training datasets

Training Pipeline

All fine-tuned models follow the Denali-AI Multi-Stage RL Pipeline:

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚           Denali-AI Training Pipeline            β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                          β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β–Ό                     β–Ό                     β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚  Stage 1  β”‚        β”‚   Stage 2    β”‚      β”‚   Stage 3    β”‚
              β”‚   SFT     │───────▢│    GRPO      │─────▢│    GTPO      β”‚
              β”‚  (LoRA)   β”‚        β”‚  (Rewards)   β”‚      β”‚ (Trajectory) β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚                     β”‚                     β”‚
              JSON format          Field accuracy         Coherence &
              acquisition          optimization           regularization

Stage 1: Supervised Fine-Tuning (SFT)

  • Method: LoRA (r=16, alpha=32) on frozen base model
  • Data: train-10k-balanced-v3 β€” 10,000 curated samples
  • Objective: Teach valid JSON output format and basic field extraction
  • Key outcome: 100% JSON parse rate

Stage 2: Group Relative Policy Optimization (GRPO)

  • Method: Reward-based RL without a critic model
  • Reward engine: 3-layer scoring system
    • Layer 1: JSON validity gate (binary)
    • Layer 2: Structural correctness (20% weight)
    • Layer 3: Per-field content accuracy (80% weight)
  • Key outcome: Improved field-level accuracy, especially for challenging fields

Stage 3: Group-relative Trajectory-based Policy Optimization (GTPO)

  • Method: Conflict-aware gradient optimization with entropy regularization
  • Key outcome: Trajectory-level coherence and reduced field-level conflicts

Evaluation Methodology

Benchmark

All models are evaluated on eval_hard_3500 β€” a curated benchmark of 3,500 challenging garment images selected for diversity in:

  • Garment type (tops, bottoms, dresses, outerwear, accessories)
  • Visual complexity (patterns, prints, multi-color)
  • Edge cases (ambiguous attributes, partially visible labels)

Metrics

We employ a comprehensive multi-metric evaluation framework rather than relying on exact match. Each metric captures a different dimension of prediction quality:

Metric Model Description
SBERT Cosine all-MiniLM-L6-v2 Semantic similarity via sentence embeddings
NLI Score nli-MiniLM2-L6-H768 Natural language inference entailment
Levenshtein Ratio β€” Fuzzy string matching distance
Token F1 β€” Token-level precision and recall
SBERT+NLI Combined β€” Primary metric: average of SBERT cosine and NLI
Weighted Score β€” Field-weighted aggregate (see weights above)
Metric Definitions (click to expand)

SBERT Cosine Similarity

Measures how semantically close the predicted value is to the ground truth by encoding both strings into dense vector embeddings using the all-MiniLM-L6-v2 sentence-transformer model and computing their cosine similarity. A score of 1.0 means the embeddings are identical in direction (semantically equivalent), while 0.0 means they are orthogonal (unrelated). This captures meaning-level similarity β€” for example, "navy blue" and "dark blue" score high despite being different strings. Values are thresholded: scores above 0.85 map to full credit, scores below 0.50 map to zero, and values in between are linearly scaled.

NLI Score (Natural Language Inference)

Uses a cross-encoder NLI model (nli-MiniLM2-L6-H768) to determine whether the predicted value entails, contradicts, or is neutral to the ground truth. The model evaluates the pair as a premise-hypothesis pair (e.g., "the color is navy blue" vs "the color is dark blue"). Entailment probability above 0.6 yields a score of at least 0.8; contradiction probability above 0.6 heavily penalizes the score (scaled down to 30% of base). This metric is particularly valuable for detecting semantic contradictions that string-level metrics would miss β€” e.g., "long sleeve" vs "short sleeve" are textually similar but semantically opposite.

Levenshtein Ratio

Computes the normalized edit distance between the predicted and ground-truth strings (after lowercasing and stripping). The ratio is 1 - (edit_distance / max_length), ranging from 0.0 (completely different) to 1.0 (identical). This character-level metric catches minor spelling variations and typos β€” for example, "pullover" vs "pull-over" score nearly 1.0. It complements the semantic metrics by providing a surface-level similarity signal that is model-free and deterministic.

Token F1

Computes token-level precision and recall by treating the predicted and ground-truth strings as bags of whitespace-delimited tokens. Precision is the fraction of predicted tokens that appear in the ground truth; recall is the fraction of ground-truth tokens that appear in the prediction. F1 is their harmonic mean. This metric handles multi-word values well β€” "light blue cotton" vs "blue cotton" gets partial credit for the overlapping tokens, unlike exact match which would score 0. Particularly useful for defect descriptions and color fields where partial matches are meaningful.

SBERT+NLI Combined

The primary evaluation metric used for ranking models. It combines SBERT cosine similarity and NLI scoring in a cascaded approach inspired by the training reward engine: first, the SBERT cosine score is mapped to a base score (1.0 if cosine >= 0.85, linearly scaled between 0.50-0.85, 0.0 below 0.50). Then, NLI adjusts this base: if the NLI model detects strong entailment (>0.6), the score is boosted to at least 0.8; if it detects strong contradiction (>0.6), the score is reduced to 30% of the base. This two-stage approach leverages both embedding similarity and logical inference for robust evaluation.

Weighted Score

The headline metric for model comparison. It multiplies each field's SBERT+NLI Combined score by its domain-specific importance weight (type=2.5x, defect=2.0x, brand=1.5x, size=1.5x, others=1.0x) and normalizes by the total weight. This reflects real-world value β€” correctly identifying garment type and defects matters more than getting the closure style right. A hallucination (predicting a value when ground truth is null) incurs a -0.3 penalty to discourage false positives. The weighted score ranges from 0% to 100%, with our best model achieving 91.3%.

JSON Parse Rate

The percentage of model outputs that are valid, parseable JSON objects. Fine-tuned models achieve 100%; zero-shot models often fail at 0-14%. This is a binary pass/fail gate β€” if the output cannot be parsed as JSON, all field scores for that sample are 0.

Throughput

End-to-end inference speed measured in samples per second, including network overhead, across 8 concurrent workers hitting a vLLM server. Higher throughput indicates better production viability. Measured on NVIDIA RTX PRO 6000 Blackwell (98 GB VRAM).

This multi-metric approach captures semantic similarity rather than requiring exact string matches, which is critical for fields like color ("navy blue" vs "dark blue") and defect descriptions.

Evaluation Protocol

  • Inference: 8 concurrent workers via OpenAI-compatible API (vLLM)
  • Samples: All 3,500 samples, no subsampling
  • Compute: NVIDIA RTX PRO 6000 Blackwell (98 GB VRAM)
  • Reproducibility: Fixed prompts, deterministic sampling (temperature=0)

Key Findings

  1. Qwen3-VL-8B SFT+GRPO is the new champion at 91.3%. Fine-tuning the 8B model with SFT+GRPO surpasses the previous best (2B v9 at 89.5%) while maintaining 100% JSON parse rate.

  2. Architecture matters more than scale. The 2B Qwen3-VL (89.5%) outperforms the 35B Qwen3.5 MoE (50.7%) by a wide margin, and even the zero-shot Qwen3-VL-8B (87.5%) outperforms all fine-tuned Qwen3.5-VL models.

  3. SFT is non-negotiable for structured output. All fine-tuned models achieve 100% JSON parse rate; all zero-shot NVFP4/GPTQ models fail at 0-14%. No amount of model scale compensates for the lack of format training.

  4. NVFP4 quantization preserves accuracy for Qwen3-VL. The 8B NVFP4 variant loses only 1.8pp (91.3% vs 89.5%) while gaining 61% throughput (7.5 vs 12.1 samples/s). The 2B NVFP4 loses 5.3pp but gains 8% throughput.

  5. FP8 quantization is effectively free. InternVL3-2B loses <1% accuracy with FP8, while gaining 21% throughput improvement (11.8 vs 14.3 samples/s).

  6. Qwen3-VL dominates all scales. The top 8 models are all Qwen3-VL variants. Even zero-shot Qwen3-VL-8B (87.5%) outperforms all fine-tuned InternVL3 and Qwen3.5-VL models.

  7. RL provides meaningful but modest gains. GRPO+GTPO adds +1.6% weighted score over SFT-only for Qwen3.5-2B, with the largest gains on brand (+9.2pp) and defect (+5.6pp).


Research Directions & Future Work

Near-Term Improvements

Direction Expected Impact Effort
GTPO on Qwen3-VL-8B SFT+GRPO +1-3pp weighted (add trajectory optimization to the #1 model) Low
GTPO on Qwen3-VL-2B v9 +2-4pp weighted (currently SFT+GRPO only) Low
SFT on Qwen3-VL-8B from zero-shot Push past 91.3% with better starting point Low
QLoRA on Qwen3.5-35B GPTQ JSON parse 14% -> 100%, weighted 50% -> ~80%+ Low
OCR pre-processing pipeline Fix brand/size for Qwen3.5 models (+30-60pp on those fields) Medium
Higher LoRA rank (r=32/64) +1-3pp from increased adapter capacity Low
Guided JSON decoding Force 100% JSON parse on zero-shot models without training Low

Architecture Exploration

Models we haven't tested but are strong candidates:

Model Parameters Why Promising
InternVL3-4B 4B Mid-range InternVL β€” may close gap to Qwen3-VL
SmolVLM2-2.2B 2.2B HuggingFace's efficient VLM β€” strong structured output
PaliGemma2-3B 3B Google VLM with excellent OCR β€” may solve brand/size
MiniCPM-V-2.6 2.8B Strong small VLM with good OCR capabilities
Qwen3-VL-32B 32B Largest Qwen3-VL β€” given 8B dominance, could push past 95%

Long-Term Research

  1. Ensemble routing: Use a lightweight classifier to route each field to the best-performing model (e.g., Qwen3-VL for visual attributes, InternVL3 for brand/size)
  2. Curriculum learning: Progressive difficulty training β€” easy garments first, hard edge cases last
  3. Synthetic data generation: Use large VLMs (122B) to generate training labels for unlabeled garment images at scale
  4. Multi-image input: Leverage front + back + tag images simultaneously for higher accuracy
  5. Active learning: Identify samples where models disagree most and prioritize annotation of those

Key Open Questions

  • Why does Qwen3-VL dramatically outperform Qwen3.5-VL at the same scale? Is it the vision encoder, the cross-attention mechanism, or training data?
  • Can RL gains be amplified beyond +1.8pp on the 8B model? Current GRPO hyperparameters may be suboptimal
  • Is there a parameter count sweet spot between 8B and 32B where accuracy saturates?
  • Would instruction-tuned base models (vs base models) yield better SFT starting points?

Datasets

Dataset Samples Purpose Link
eval_hard_3500 3,500 Evaluation benchmark (hard subset) Link
train_10k_balanced_v3 10,000 Training data (balanced sampling) Link

Citation

@misc{denali-ai-2026,
  title={Structured Garment Attribute Extraction via Multi-Stage Reinforcement Learning},
  author={Denali AI},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/Denali-AI}
}

License

All models and datasets are released under the Apache 2.0 License.

Contact