---
language:
- zh
- en
tags:
- chatglm
- blip2
---

# Model Card for blip2zh-chatglm-6b

## Model Details

### Model Description

blip2zh-chatglm-6b是基于blip2训练的中文多模态聊天模型。具有基本的图像理解能力。
由于blip2的训练方式不会对语言模型进行微调，因此在纯文本对话中的行为可以保持和原始chatglm一致。

注意：由于目前模型仅经过blip2两阶段图文对齐预训练，没有包括vqa或者指令微调等具体下游任务的训练，因此依然容易生成不符合预期的内容。

- **blip2 base model**: [bert-base-chinese](https://huggingface.co/bert-base-chinese)
- **Vision encoder**: eva-clip-vit-g
- **Language model**: [chatglm-6b](https://github.com/THUDM/ChatGLM-6B) at [commit](https://huggingface.co/THUDM/chatglm-6b/commit/9324de70a93207c9a310cf99d5d6261791489691)

### Model Sources

- [**Training Code**](https://github.com/XiPotatonium/LAVIS): blip2训练代码，基于[LAVIS](https://github.com/salesforce/LAVIS)
- [**webui**](https://github.com/XiPotatonium/chatbot-webui): 一个由gradio实现的webui
- [**api**](https://github.com/XiPotatonium/chatbot-api): 一个由fastapi实现的api服务，可以部署在本地，同时也支持一些其他类型的本地可部署语言模型。

## Uses

模型参数包含了图像编码器，blip2和chatglm-6b。

加载模型及推理可以参考[api](https://github.com/XiPotatonium/chatbot-api/blob/main/src/model/blip2chatglm/__init__.py)的实现

一些[example](https://github.com/XiPotatonium/chatbot-api/blob/main/examples.ipynb)

## Limitations

受限于中文数据集，目前图像理解能力依然有限，会产生无关或者错误的内容。
目前没有引入多轮对话训练以及指令微调。多轮对话可能会受到上下文的干扰。
并且同样受限于chatglm-6b本身的对话效果。

## Training Details

### Training Data

* [laion-2b-chinese](https://huggingface.co/datasets/IDEA-CCNL/laion2B-multi-chinese-subset): 我们仅选取了其中clip分数较高的670k图文对并采样了部分数据进行训练。
* [coco-zh](https://github.com/li-xirong/coco-cn)
* [flickr8k-zh](http://lixirong.net/datasets/flickr8kcn)

### Training Procedure

基于blip2的两阶段训练方法

## Demos

![](imgs/demo1.png)
![](imgs/demo2.png)
![](imgs/demo3.png)