模型描述

M2-Encoder是强大的中英双语多模态模型，它在我们构建的包含60亿图文对(30亿中文+30亿英文)的BM-6B上训练得到，支持zero-shot 图文跨模态检索(文搜图、图搜文) 以及 zero-shot图片分类任务。

模型效果如下：

期望模型使用方式以及适用范围

本模型主要用于：

图片检索文本，或文本检索图片: 以文本检索图片为例，使用M2-Encoder提前对所有图片底库进行特征抽取，给定文本query，使用M2-Encoder对query文本进行特征抽取，然后和图片底库保存的特征进行相似度计算。
图片zero-shot开集分类: 给定图像以及对应的标签列表，根据图像和标签相似度，输出与图像最匹配的标签。

如何使用

代码范例

# 新建环境（Python版本3.8）
conda create -n m2-encoder python=3.8
source activate m2-encoder

# clone项目地址
cd /YourPath/
git clone https://github.com/alipay/Ant-Multi-Modal-Framework

# 安装包依赖
cd ./Ant-Multi-Modal-Framework/prj/M2_Encoder/
pip install -r requirements.txt

# 运行demo，会自动通过model_scope下载对应模型权重
python run.py

模型局限性以及可能的偏差

模型在数据集上训练，有可能产生一些偏差，请用户自行评测后决定如何使用。

训练数据介绍

BM-6B数据集: 包含60亿清洗后的高质量中英双语图文对数据，其中文和英文数据比例基本保持一致，均为30亿。数据集搜集、构建过程详见技术报告。

模型训练流程

暂时不支持通过ModelScope接口进行训练，敬请期待。

训练

暂不支持。

数据评估及结果

zero-shot图文跨模态检索和zero-shot分类任务均达到SOTA.

相关论文以及引用信息

如果你觉得这个该模型对有所帮助，请考虑引用下面的相关的论文：

@misc{guo2024m2encoder,
      title={M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining}, 
      author={Qingpei Guo and Furong Xu and Hanxiao Zhang and Wang Ren and Ziping Ma and Lin Ju and Jian Wang and Jingdong Chen and Ming Yang},
      year={2024},
      url={https://arxiv.org/abs/2401.15896},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for acai66/M2_Encoder_Large

M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining

Paper • 2401.15896 • Published Jan 29, 2024