Spaces:

Denali-AI
/

README

Configuration error

App Files Files Community

README / README.md

msudharsanan

Update README.md

7371f1c verified 3 days ago

preview code

raw

history blame contribute delete

22.5 kB

metadata

title: Denali AI
short_description: VLMs for Garment Attribute Extraction

Denali AI — Vision-Language Models for Garment Classification

Advancing structured attribute extraction from garment images through multi-stage reinforcement learning

Abstract

Denali AI develops and benchmarks vision-language models (VLMs) for structured garment attribute extraction — the task of analyzing a garment image and producing a complete JSON object describing 9 key attributes: type, color, pattern, neckline, sleeve length, closure, brand, size, and defect type.

We systematically evaluate the impact of supervised fine-tuning (SFT), Group Relative Policy Optimization (GRPO), and Group-relative Trajectory-based Policy Optimization (GTPO) across multiple model architectures (Qwen3-VL, Qwen3.5-VL, InternVL3, Florence-2, Moondream2, Phi-4) and scales (1.6B to 122B parameters). Our best model, Qwen3-VL-8B SFT+GRPO, achieves 91.3% weighted score with 100% JSON parse rate on the eval_hard_3500 benchmark.

Leaderboard

Rank	Model	Architecture	Params	Training	Weighted	SBERT+NLI	JSON Parse	Throughput
1	Qwen3-VL-8B SFT+GRPO	Qwen3-VL	8B	SFT+GRPO	91.3%	78.7%	100%	7.5/s
2	Qwen3-VL-2B-SFT-GRPO-v9	Qwen3-VL	2B	SFT+GRPO	89.5%	78.5%	100%	15.9/s
3	Qwen3-VL-8B SFT+GRPO NVFP4	Qwen3-VL	8B	SFT+GRPO	89.5%	77.0%	100%	12.1/s
4	Qwen3-VL-8B-Instruct-Base	Qwen3-VL	8B	Zero-shot	87.5%	75.6%	100%	5.5/s
5	Qwen3-VL-8B-Instruct NVFP4	Qwen3-VL	8B	Zero-shot	87.2%	75.0%	100%	8.2/s
6	Qwen3.5-VL-2B Base	Qwen3.5-VL	2B	Zero-shot	84.4%	73.0%	100%	6.6/s
7	Qwen3-VL-2B SFT+GRPO v9 NVFP4	Qwen3-VL	2B	SFT+GRPO	84.2%	74.1%	100%	17.2/s
8	Qwen3-VL-2B-Instruct Base	Qwen3-VL	2B	Zero-shot	76.4%	66.7%	100%	15.1/s
9	InternVL3-2B GRPO+GTPO Full	InternVL3	2B	GRPO+GTPO	72.7%	64.3%	100%	11.8/s
10	InternVL3-2B GRPO+GTPO FP8	InternVL3	2B	GRPO+GTPO	72.2%	63.8%	100%	14.3/s
11	InternVL3-2B Base	InternVL3	2B	Zero-shot	71.8%	63.7%	100%	11.8/s
12	Moondream2 Base	Moondream2	1.6B	Zero-shot	69.8%	61.8%	100%	1.4/s
13	Qwen3.5-VL-2B SFT+GRPO+GTPO	Qwen3.5-VL	2B	SFT+GRPO+GTPO	65.3%	60.1%	100%	11.3/s
14	Qwen3.5-VL-2B SFT	Qwen3.5-VL	2B	SFT	63.7%	58.9%	100%	11.6/s
15	Qwen3.5-VL-35B GPTQ-Int4	Qwen3.5-VL MoE	35B (3B)	Zero-shot	50.7%	48.7%	14%	1.6/s
16	Qwen3.5-VL-9B NVFP4	Qwen3.5-VL	9B	Zero-shot	47.0%	46.0%	8%	1.7/s
17	Qwen3.5-VL-9B SFT NVFP4	Qwen3.5-VL	9B	SFT	46.3%	45.5%	8%	1.7/s
18	Qwen3.5-VL-2B Base NVFP4	Qwen3.5-VL	2B	Zero-shot	42.9%	42.9%	0%	4.0/s
19	Qwen3.5-VL-122B NVFP4	Qwen3.5-VL MoE	122B (10B)	Zero-shot	42.9%	42.9%	0%	1.2/s
20	Qwen3.5-VL-2B SFT NVFP4	Qwen3.5-VL	2B	SFT	42.9%	42.9%	0%	4.0/s
21	Qwen3.5-VL-2B SFT+GRPO+GTPO NVFP4	Qwen3.5-VL	2B	SFT+GRPO+GTPO	42.9%	42.9%	0%	3.9/s
22	Phi-4-Multimodal NVFP4	Phi-4	5.6B	Zero-shot	42.9%	42.9%	0%	—

Note: Models ranked 18-22 have 0% JSON parse rate under NVFP4 quantization, meaning they cannot produce valid structured output — their weighted scores reflect the 42.9% floor from partial field matches in malformed outputs. Fine-tuning is required to unlock their potential.

Task Definition

Given a single garment image, the model must extract 9 structured attributes as a valid JSON object:

{
  "type": "t-shirt",
  "color": "navy blue",
  "pattern": "solid",
  "neckline": "crew neck",
  "sleeve_length": "short sleeve",
  "closure": "pullover",
  "brand": "Nike",
  "size": "M",
  "defect_type": "small hole on left shoulder"
}

Field Importance Weights

Not all fields are equally important. The weighted score uses domain-specific multipliers:

Field	Weight	Rationale
Type	2.5x	Critical for inventory routing and categorization
Defect	2.0x	Directly impacts quality control and pricing
Brand	1.5x	Essential for authentication and valuation
Size	1.5x	Required for accurate listing and search
Color, Pattern, Neckline, Sleeve, Closure	1.0x	Standard descriptive attributes

Key Results

Per-Field Performance

Accuracy vs Throughput

Key finding: Qwen3-VL-2B v9 NVFP4 achieves the best accuracy-throughput trade-off at 84.2% weighted score and 17.2 samples/s — making it the Pareto-optimal choice for production deployment. For maximum accuracy, the Qwen3-VL-8B SFT+GRPO reaches 91.3% at 7.5 samples/s.

Structured Output Reliability

Fine-tuned models achieve 100% JSON parse rate, while zero-shot baselines (GPTQ, NVFP4) fail to produce valid JSON in 86-100% of cases. This demonstrates that SFT is essential for teaching structured output format, regardless of model scale.

Impact of Training Stages

Left panel: Adding GRPO+GTPO to Qwen3.5-2B improves brand recognition from 15.6% to 24.8% and defect detection from 89.5% to 95.1%, with a +1.6% overall gain.

Right panel: FP8 quantization of InternVL3-2B shows <1% accuracy degradation across all fields while reducing memory footprint, confirming FP8 as a practical deployment optimization.

Model Collections

By Architecture

Collection	Models	Description
Qwen3-VL	7	Top-performing Qwen3-VL based models (2B and 8B)
Qwen3.5-VL	10	Qwen3.5-VL models (0.8B to 122B)
InternVL3	6	InternVL3 models (1B, 2B)
Florence-2	3	Florence-2 encoder-decoder models
Benchmarks	2	Evaluation and training datasets

Training Pipeline

All fine-tuned models follow the Denali-AI Multi-Stage RL Pipeline:

                    ┌─────────────────────────────────────────────────┐
                    │           Denali-AI Training Pipeline            │
                    └─────────────────────────────────────────────────┘
                                          │
                    ┌─────────────────────┼─────────────────────┐
                    ▼                     ▼                     ▼
              ┌──────────┐        ┌──────────────┐      ┌──────────────┐
              │  Stage 1  │        │   Stage 2    │      │   Stage 3    │
              │   SFT     │───────▶│    GRPO      │─────▶│    GTPO      │
              │  (LoRA)   │        │  (Rewards)   │      │ (Trajectory) │
              └──────────┘        └──────────────┘      └──────────────┘
                    │                     │                     │
              JSON format          Field accuracy         Coherence &
              acquisition          optimization           regularization

Stage 1: Supervised Fine-Tuning (SFT)

Method: LoRA (r=16, alpha=32) on frozen base model
Data: train-10k-balanced-v3 — 10,000 curated samples
Objective: Teach valid JSON output format and basic field extraction
Key outcome: 100% JSON parse rate

Stage 2: Group Relative Policy Optimization (GRPO)

Method: Reward-based RL without a critic model
Reward engine: 3-layer scoring system
- Layer 1: JSON validity gate (binary)
- Layer 2: Structural correctness (20% weight)
- Layer 3: Per-field content accuracy (80% weight)
Key outcome: Improved field-level accuracy, especially for challenging fields

Stage 3: Group-relative Trajectory-based Policy Optimization (GTPO)

Method: Conflict-aware gradient optimization with entropy regularization
Key outcome: Trajectory-level coherence and reduced field-level conflicts

Evaluation Methodology

Benchmark

All models are evaluated on eval_hard_3500 — a curated benchmark of 3,500 challenging garment images selected for diversity in:

Garment type (tops, bottoms, dresses, outerwear, accessories)
Visual complexity (patterns, prints, multi-color)
Edge cases (ambiguous attributes, partially visible labels)

Metrics

We employ a comprehensive multi-metric evaluation framework rather than relying on exact match. Each metric captures a different dimension of prediction quality:

Metric	Model	Description
SBERT Cosine	all-MiniLM-L6-v2	Semantic similarity via sentence embeddings
NLI Score	nli-MiniLM2-L6-H768	Natural language inference entailment
Levenshtein Ratio	—	Fuzzy string matching distance
Token F1	—	Token-level precision and recall
SBERT+NLI Combined	—	Primary metric: average of SBERT cosine and NLI
Weighted Score	—	Field-weighted aggregate (see weights above)

Metric Definitions (click to expand)

SBERT Cosine Similarity

Measures how semantically close the predicted value is to the ground truth by encoding both strings into dense vector embeddings using the all-MiniLM-L6-v2 sentence-transformer model and computing their cosine similarity. A score of 1.0 means the embeddings are identical in direction (semantically equivalent), while 0.0 means they are orthogonal (unrelated). This captures meaning-level similarity — for example, "navy blue" and "dark blue" score high despite being different strings. Values are thresholded: scores above 0.85 map to full credit, scores below 0.50 map to zero, and values in between are linearly scaled.

NLI Score (Natural Language Inference)

Uses a cross-encoder NLI model (nli-MiniLM2-L6-H768) to determine whether the predicted value entails, contradicts, or is neutral to the ground truth. The model evaluates the pair as a premise-hypothesis pair (e.g., "the color is navy blue" vs "the color is dark blue"). Entailment probability above 0.6 yields a score of at least 0.8; contradiction probability above 0.6 heavily penalizes the score (scaled down to 30% of base). This metric is particularly valuable for detecting semantic contradictions that string-level metrics would miss — e.g., "long sleeve" vs "short sleeve" are textually similar but semantically opposite.

Levenshtein Ratio

Computes the normalized edit distance between the predicted and ground-truth strings (after lowercasing and stripping). The ratio is 1 - (edit_distance / max_length), ranging from 0.0 (completely different) to 1.0 (identical). This character-level metric catches minor spelling variations and typos — for example, "pullover" vs "pull-over" score nearly 1.0. It complements the semantic metrics by providing a surface-level similarity signal that is model-free and deterministic.

Token F1

Computes token-level precision and recall by treating the predicted and ground-truth strings as bags of whitespace-delimited tokens. Precision is the fraction of predicted tokens that appear in the ground truth; recall is the fraction of ground-truth tokens that appear in the prediction. F1 is their harmonic mean. This metric handles multi-word values well — "light blue cotton" vs "blue cotton" gets partial credit for the overlapping tokens, unlike exact match which would score 0. Particularly useful for defect descriptions and color fields where partial matches are meaningful.

SBERT+NLI Combined

The primary evaluation metric used for ranking models. It combines SBERT cosine similarity and NLI scoring in a cascaded approach inspired by the training reward engine: first, the SBERT cosine score is mapped to a base score (1.0 if cosine >= 0.85, linearly scaled between 0.50-0.85, 0.0 below 0.50). Then, NLI adjusts this base: if the NLI model detects strong entailment (>0.6), the score is boosted to at least 0.8; if it detects strong contradiction (>0.6), the score is reduced to 30% of the base. This two-stage approach leverages both embedding similarity and logical inference for robust evaluation.

Weighted Score

The headline metric for model comparison. It multiplies each field's SBERT+NLI Combined score by its domain-specific importance weight (type=2.5x, defect=2.0x, brand=1.5x, size=1.5x, others=1.0x) and normalizes by the total weight. This reflects real-world value — correctly identifying garment type and defects matters more than getting the closure style right. A hallucination (predicting a value when ground truth is null) incurs a -0.3 penalty to discourage false positives. The weighted score ranges from 0% to 100%, with our best model achieving 91.3%.

JSON Parse Rate

The percentage of model outputs that are valid, parseable JSON objects. Fine-tuned models achieve 100%; zero-shot models often fail at 0-14%. This is a binary pass/fail gate — if the output cannot be parsed as JSON, all field scores for that sample are 0.

Throughput

End-to-end inference speed measured in samples per second, including network overhead, across 8 concurrent workers hitting a vLLM server. Higher throughput indicates better production viability. Measured on NVIDIA RTX PRO 6000 Blackwell (98 GB VRAM).

This multi-metric approach captures semantic similarity rather than requiring exact string matches, which is critical for fields like color ("navy blue" vs "dark blue") and defect descriptions.

Evaluation Protocol

Inference: 8 concurrent workers via OpenAI-compatible API (vLLM)
Samples: All 3,500 samples, no subsampling
Compute: NVIDIA RTX PRO 6000 Blackwell (98 GB VRAM)
Reproducibility: Fixed prompts, deterministic sampling (temperature=0)

Key Findings

Qwen3-VL-8B SFT+GRPO is the new champion at 91.3%. Fine-tuning the 8B model with SFT+GRPO surpasses the previous best (2B v9 at 89.5%) while maintaining 100% JSON parse rate.
Architecture matters more than scale. The 2B Qwen3-VL (89.5%) outperforms the 35B Qwen3.5 MoE (50.7%) by a wide margin, and even the zero-shot Qwen3-VL-8B (87.5%) outperforms all fine-tuned Qwen3.5-VL models.
SFT is non-negotiable for structured output. All fine-tuned models achieve 100% JSON parse rate; all zero-shot NVFP4/GPTQ models fail at 0-14%. No amount of model scale compensates for the lack of format training.
NVFP4 quantization preserves accuracy for Qwen3-VL. The 8B NVFP4 variant loses only 1.8pp (91.3% vs 89.5%) while gaining 61% throughput (7.5 vs 12.1 samples/s). The 2B NVFP4 loses 5.3pp but gains 8% throughput.
FP8 quantization is effectively free. InternVL3-2B loses <1% accuracy with FP8, while gaining 21% throughput improvement (11.8 vs 14.3 samples/s).
Qwen3-VL dominates all scales. The top 8 models are all Qwen3-VL variants. Even zero-shot Qwen3-VL-8B (87.5%) outperforms all fine-tuned InternVL3 and Qwen3.5-VL models.
RL provides meaningful but modest gains. GRPO+GTPO adds +1.6% weighted score over SFT-only for Qwen3.5-2B, with the largest gains on brand (+9.2pp) and defect (+5.6pp).

Research Directions & Future Work

Near-Term Improvements

Direction	Expected Impact	Effort
GTPO on Qwen3-VL-8B SFT+GRPO	+1-3pp weighted (add trajectory optimization to the #1 model)	Low
GTPO on Qwen3-VL-2B v9	+2-4pp weighted (currently SFT+GRPO only)	Low
SFT on Qwen3-VL-8B from zero-shot	Push past 91.3% with better starting point	Low
QLoRA on Qwen3.5-35B GPTQ	JSON parse 14% -> 100%, weighted 50% -> ~80%+	Low
OCR pre-processing pipeline	Fix brand/size for Qwen3.5 models (+30-60pp on those fields)	Medium
Higher LoRA rank (r=32/64)	+1-3pp from increased adapter capacity	Low
Guided JSON decoding	Force 100% JSON parse on zero-shot models without training	Low

Architecture Exploration

Models we haven't tested but are strong candidates:

Model	Parameters	Why Promising
InternVL3-4B	4B	Mid-range InternVL — may close gap to Qwen3-VL
SmolVLM2-2.2B	2.2B	HuggingFace's efficient VLM — strong structured output
PaliGemma2-3B	3B	Google VLM with excellent OCR — may solve brand/size
MiniCPM-V-2.6	2.8B	Strong small VLM with good OCR capabilities
Qwen3-VL-32B	32B	Largest Qwen3-VL — given 8B dominance, could push past 95%

Long-Term Research

Ensemble routing: Use a lightweight classifier to route each field to the best-performing model (e.g., Qwen3-VL for visual attributes, InternVL3 for brand/size)
Curriculum learning: Progressive difficulty training — easy garments first, hard edge cases last
Synthetic data generation: Use large VLMs (122B) to generate training labels for unlabeled garment images at scale
Multi-image input: Leverage front + back + tag images simultaneously for higher accuracy
Active learning: Identify samples where models disagree most and prioritize annotation of those

Key Open Questions

Why does Qwen3-VL dramatically outperform Qwen3.5-VL at the same scale? Is it the vision encoder, the cross-attention mechanism, or training data?
Can RL gains be amplified beyond +1.8pp on the 8B model? Current GRPO hyperparameters may be suboptimal
Is there a parameter count sweet spot between 8B and 32B where accuracy saturates?
Would instruction-tuned base models (vs base models) yield better SFT starting points?

Datasets

Dataset	Samples	Purpose	Link
eval_hard_3500	3,500	Evaluation benchmark (hard subset)	Link
train_10k_balanced_v3	10,000	Training data (balanced sampling)	Link

Citation

@misc{denali-ai-2026,
  title={Structured Garment Attribute Extraction via Multi-Stage Reinforcement Learning},
  author={Denali AI},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/Denali-AI}
}

License

All models and datasets are released under the Apache 2.0 License.

Contact

Organization: Denali Advanced Integration
Issues: GitHub
HuggingFace: Denali-AI