|
--- |
|
license: mit |
|
datasets: |
|
- laion/laion2B-en |
|
- laion/laion-coco |
|
- laion/laion2B-multi |
|
- kakaobrain/coyo-700m |
|
- conceptual_captions |
|
- wanng/wukong100m |
|
pipeline_tag: image-feature-extraction |
|
--- |
|
|
|
# InternViT-6B-224px |
|
|
|
[\[π GitHub\]](https://github.com/OpenGVLab/InternVL) [\[π InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[π InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[π Mini-InternVL\]](https://arxiv.org/abs/2410.16261) [\[π InternVL 2.5\]](https://huggingface.co/papers/2412.05271) |
|
|
|
[\[π Blog\]](https://internvl.github.io/blog/) [\[π¨οΈ Chat Demo\]](https://internvl.opengvlab.com/) [\[π€ HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[π Quick Start\]](#quick-start) [\[π Documents\]](https://internvl.readthedocs.io/en/latest/) |
|
|
|
<div align="center"> |
|
<img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/zJsd2hqd3EevgXo6fNgC-.png"> |
|
</div> |
|
|
|
## Model Details |
|
- **Model Type:** vision foundation model, feature backbone |
|
- **Model Stats:** |
|
- Params (M): 5903 |
|
- Image size: 224 x 224 |
|
- **Pretrain Dataset:** LAION-en, LAION-COCO, COYO, CC12M, CC3M, SBU, Wukong, LAION-multi |
|
- **Note:** This model has 48 blocks, and we found that using the output after the fourth-to-last block worked best for VLLM. Therefore, when building a VLLM with this model, **please use the features from the fourth-to-last layer.** |
|
|
|
## Linear Probing Performance |
|
|
|
See this [document](https://github.com/OpenGVLab/InternVL/tree/main/classification#-evaluation) for more details about the linear probing evaluation. |
|
|
|
| IN-1K | IN-ReaL | IN-V2 | IN-A | IN-R | IN-Sketch | |
|
| :---: | :-----: | :---: | :--: | :--: | :-------: | |
|
| 88.2 | 90.4 | 79.9 | 77.5 | 89.8 | 69.1 | |
|
|
|
## Model Usage (Image Embeddings) |
|
|
|
```python |
|
import torch |
|
from PIL import Image |
|
from transformers import AutoModel, CLIPImageProcessor |
|
|
|
model = AutoModel.from_pretrained( |
|
'OpenGVLab/InternViT-6B-224px', |
|
torch_dtype=torch.bfloat16, |
|
low_cpu_mem_usage=True, |
|
trust_remote_code=True).cuda().eval() |
|
|
|
image = Image.open('./examples/image1.jpg').convert('RGB') |
|
|
|
image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-224px') |
|
|
|
pixel_values = image_processor(images=image, return_tensors='pt').pixel_values |
|
pixel_values = pixel_values.to(torch.bfloat16).cuda() |
|
|
|
outputs = model(pixel_values) |
|
``` |
|
|
|
## Citation |
|
|
|
If you find this project useful in your research, please consider citing: |
|
|
|
```BibTeX |
|
@article{chen2024expanding, |
|
title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling}, |
|
author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others}, |
|
journal={arXiv preprint arXiv:2412.05271}, |
|
year={2024} |
|
} |
|
@article{gao2024mini, |
|
title={Mini-internvl: A flexible-transfer pocket multimodal model with 5\% parameters and 90\% performance}, |
|
author={Gao, Zhangwei and Chen, Zhe and Cui, Erfei and Ren, Yiming and Wang, Weiyun and Zhu, Jinguo and Tian, Hao and Ye, Shenglong and He, Junjun and Zhu, Xizhou and others}, |
|
journal={arXiv preprint arXiv:2410.16261}, |
|
year={2024} |
|
} |
|
@article{chen2024far, |
|
title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites}, |
|
author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others}, |
|
journal={arXiv preprint arXiv:2404.16821}, |
|
year={2024} |
|
} |
|
@inproceedings{chen2024internvl, |
|
title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks}, |
|
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others}, |
|
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, |
|
pages={24185--24198}, |
|
year={2024} |
|
} |
|
``` |
|
|