YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Zhihua is the first domain-specific vision-language model dedicated to Traditional Chinese Painting (TCP) appreciation and storytelling. It combines supervised fine-tuning (SFT) with Direct Preference Optimization (DPO) and a threshold-based Retrieval-Augmented Generation (RAG) mechanism to reduce hallucinations and produce culturally resonant narratives. The model is trained on a large-scale TCP dataset containing question-answer pairs, technique annotations, and expert-level textual analyses. Extensive experiments and human evaluations show that Zhihua generates more faithful, coherent, and aesthetically aware descriptions than existing vision-language baselines.
Model Details
Model Description
Zhihua accepts an image of a traditional Chinese painting and free-form text questions (optional), then outputs multi-lingual narratives that explain artistic techniques, historical context, symbolism, and aesthetic values. A three-stage training pipeline—SFT → DPO → RAG—iteratively aligns the model with expert preferences while mitigating hallucinations. Domain-oriented evaluation metrics (factual correctness, stylistic fidelity, interpretive depth) are proposed to benchmark TCP understanding.
- Developed by: [Hangzhou Research Institute of AI and Holographic Technology / Zhejiang University / Hangzhou City University]
Uses
Direct Use
- Upload a TCP image and receive an automatic appreciation paragraph in Chinese or English.
- Ask open-ended questions about technique, author, dynasty, symbolism, etc.
- Integrate into museum kiosks or educational apps for AI-guided art tours.
How to Get Started with the Model
from PIL import Image
import requests
from transformers import AutoProcessor, AutoModelForVision2Seq
processor = AutoProcessor.from_pretrained("zjuqx/Zhihua")
model = AutoModelForVision2Seq.from_pretrained("zjuqx/Zhihua", trust_remote_code=True)
image = Image.open("path/to/tcp.jpg")
prompt = "请赏析这幅画。"
inputs = processor(images=image, text=prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 15