StarCycle
/

llava-siglip-internlm2-1_8b-v1

Image-Text-to-Text

Model card Files Files and versions Community

StarCycle commited on Mar 7

Commit

3e01fda

•

1 Parent(s): b204d98

Update README.md

Files changed (1) hide show

README.md +10 -2

README.md CHANGED Viewed

@@ -11,7 +11,9 @@ pipeline_tag: image-text-to-text
 ## Model
 llava-siglip-internlm2-1_8b-pretrain-v1 is a LLaVA checkpoint finetuned from [internlm2-chat-1_8b](https://huggingface.co/internlm/internlm2-chat-1_8b) and [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) with [LLaVA-Pretrain](liuhaotian/LLaVA-Pretrain) and [LLaVA-Instruct-150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) by [Xtuner](https://github.com/InternLM/xtuner). The pretraining phase took 5.5 hours on 4 Nvidia GTX 4090 GPU (see this [intermediate checkpoint](https://huggingface.co/StarCycle/llava-siglip-internlm2-1_8b-pretrain-v1)). The finetuning phase took 16 hours on 4 Nvidia GTX 4090 GPU.
-The total size of the model is around 2.2B, which is suitable for embedded applications like robotics. This model performs slightly better than [llava-clip-internlm2-1_8b-v1](https://huggingface.co/StarCycle/llava-clip-internlm2-1_8b-v1).
 I have not carefully tune the hyperparameters during training. If you have any idea to improve it, please open an issue or just send an email to zhuohengli@foxmail.com. You are welcomed!
@@ -29,7 +31,13 @@ LLaVA-InternLM2-7B | 73.3 | 74.6 | 71.7 | 72.0 | 42.5
 Bunny-3B | 69.2 | 68.6 | - | - | -
 MiniCPM-V | 64.1 | 67.9 | 62.6 | 65.3 | 41.4
 llava-clip-internlm2-1_8b-v1 | 63.3 | 63.1 | 63.6 | 61.7 | 35.3
-llava-siglip-internlm2-1_8b-v1 | - | 63.5 | - | 62.9 | 36.3
 ## Installation
 ```

 ## Model
 llava-siglip-internlm2-1_8b-pretrain-v1 is a LLaVA checkpoint finetuned from [internlm2-chat-1_8b](https://huggingface.co/internlm/internlm2-chat-1_8b) and [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) with [LLaVA-Pretrain](liuhaotian/LLaVA-Pretrain) and [LLaVA-Instruct-150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) by [Xtuner](https://github.com/InternLM/xtuner). The pretraining phase took 5.5 hours on 4 Nvidia GTX 4090 GPU (see this [intermediate checkpoint](https://huggingface.co/StarCycle/llava-siglip-internlm2-1_8b-pretrain-v1)). The finetuning phase took 16 hours on 4 Nvidia GTX 4090 GPU.
+The total size of the model is around 2.2B, which is suitable for embedded applications like robotics. This model performs slightly better than [llava-clip-internlm2-1_8b-v1](https://huggingface.co/StarCycle/llava-clip-internlm2-1_8b-v1).
+#### By the way, it's also stronger than MiniCPM-V in the test split on MMBench.
 I have not carefully tune the hyperparameters during training. If you have any idea to improve it, please open an issue or just send an email to zhuohengli@foxmail.com. You are welcomed!
 Bunny-3B | 69.2 | 68.6 | - | - | -
 MiniCPM-V | 64.1 | 67.9 | 62.6 | 65.3 | 41.4
 llava-clip-internlm2-1_8b-v1 | 63.3 | 63.1 | 63.6 | 61.7 | 35.3
+llava-siglip-internlm2-1_8b-v1 | 65.7 | 63.5 | 64.5 | 62.9 | 36.3
+For the performance in MMBench Test EN:
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/642a298ae5f33939cf3ee600/BYxaG48KXrTXuSKgmoAnS.png)
+For the performance in MMBench Test CN:
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/642a298ae5f33939cf3ee600/hGi4bpmEm3l1dJM557yAh.png)
 ## Installation
 ```