YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Zhihua is the first domain-specific vision-language model dedicated to Traditional Chinese Painting (TCP) appreciation and storytelling. It combines supervised fine-tuning (SFT) with Direct Preference Optimization (DPO) and a threshold-based Retrieval-Augmented Generation (RAG) mechanism to reduce hallucinations and produce culturally resonant narratives. The model is trained on a large-scale TCP dataset containing question-answer pairs, technique annotations, and expert-level textual analyses. Extensive experiments and human evaluations show that Zhihua generates more faithful, coherent, and aesthetically aware descriptions than existing vision-language baselines.

Model Details

Model Description

Zhihua accepts an image of a traditional Chinese painting and free-form text questions (optional), then outputs multi-lingual narratives that explain artistic techniques, historical context, symbolism, and aesthetic values. A three-stage training pipeline—SFT → DPO → RAG—iteratively aligns the model with expert preferences while mitigating hallucinations. Domain-oriented evaluation metrics (factual correctness, stylistic fidelity, interpretive depth) are proposed to benchmark TCP understanding.

  • Developed by: [Hangzhou Research Institute of AI and Holographic Technology / Zhejiang University / Hangzhou City University]

Uses

Direct Use

  • Upload a TCP image and receive an automatic appreciation paragraph in Chinese or English.
  • Ask open-ended questions about technique, author, dynasty, symbolism, etc.
  • Integrate into museum kiosks or educational apps for AI-guided art tours.

How to Get Started with the Model

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModelForVision2Seq

processor = AutoProcessor.from_pretrained("zjuqx/Zhihua")
model = AutoModelForVision2Seq.from_pretrained("zjuqx/Zhihua", trust_remote_code=True)

image = Image.open("path/to/tcp.jpg")
prompt = "请赏析这幅画。"

inputs = processor(images=image, text=prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(outputs[0], skip_special_tokens=True))
Downloads last month
15
Safetensors
Model size
33B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support