Feature Extraction
Safetensors
clip_vision_model
Vision
LLaVA

[Paper] [GitHub]

Model

We used the same Vision Transformer architecture ViT-L/14@336px as CLIP.

image/png

Data

Our model was trained on publicly available image-caption data from the LAION400M and COYO700M datasets.

Performance and Limitations

A. MLLMs Evaluation Results

In our experiments, we replaced the CLIP model in LLaVA-NeXT with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used Qwen2.5-7B. The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.

Vision Tower MLCD (ViT_L_14_336px) CLIP (ViT_L_14_336px)
LLM Qwen2.5-7B Qwen2.5-7B
AI2D 76.98 73.15
ScienceQA_img 78.09 76.35
GQA 64.17 63.31
InfoVQA_val 43.48 38.88
MMBench_cn_dev 74.83 72.51
MMBench_en_dev 76.37 74.57
MME(cognition) 432 384
MME(perception) 1598 1512
SeedBench 68.20 66.80
SeedBench_img 73.75 72.72
MMStar 50.98 48.98
MMMU 44.30 44.20
OCRBench 531.00 525.00
ChartQA 67.84 66.52
DocVQA_val 76.46 75.21
POPE 88.69 88.83
TextVQA_val 61.69 62.47

B. Linear Probe Evaluation Results

This table presents the results of linear probe evaluations comparing CLIP and MLCD models on the ViT_L_14_336px architecture across various datasets. The linear probe test freezes the pre-trained model's weights and trains a linear classifier on top to assess how well the model's representations generalize to different tasks.

Dataset MLCD (ViT_L_14_336px) CLIP (ViT_L_14_336px)
AVG 87.15 85.35
Food101 96.21 95.90
CIFAR-10 99.36 97.90
CIFAR-100 93.69 87.40
Birdsnap 88.18 79.90
SUN397 87.96 82.20
Stanford Cars 95.16 91.50
FGVC Aircraft 86.38 71.60
Describable Textures Dataset 86.70 83.00
Oxford-IIIT Pets 96.27 95.10
Caltech-101 97.92 96.00
Flowers102 99.58 99.20
MNIST 98.67 99.20
STL-10 99.28 99.70
EuroSAT 99.06 98.10
RESISC45 95.48 94.90
GTSRB 92.32 92.40
KITTI 75.39 69.20
Country211 38.12 46.40
PatchCamelyon 88.00 85.60
UCF101 92.86 92.00
Kinetics-700 73.35 73.00
CLEVR 64.40 60.30
Hateful Memes 72.00 77.30
SST-2 76.33 80.50
ImageNet 86.30 85.40

C. Limitations

Models with higher resolution are more friendly to OCR results. We are currently training such models and will soon make them available.

Acknowledgments

We would like to express our gratitude to Xie Yin and Yumeng Wang for their significant contributions to the experimental validation in MLLMs.

Downloads last month
334
Safetensors
Model size
304M params
Tensor type
F32
·
Inference Examples
Unable to determine this model's library. Check the docs .

Datasets used to train DeepGlint-AI/mlcd-vit-large-patch14-336

Collection including DeepGlint-AI/mlcd-vit-large-patch14-336