feat: 🎸 add chat model code

Files changed (14) hide show

README.md +124 -69
README_en.md +162 -0
beit3.py +108 -0
config.json +27 -0
configuration_viscpmchatbee.py +133 -0
feature_extraction_viscpmchatbee.py +17 -0
generation_config.json +12 -0
modeling_cpmbee.py +0 -0
preprocessor_config.json +10 -0
processing_viscpmchatbee.py +428 -0
tokenization_viscpmchatbee.py +1007 -0
tokenizer_config.json +10 -0
utils.py +730 -0
vocab.txt +0 -0

README.md CHANGED Viewed

@@ -1,58 +1,45 @@
----
-language:
-- en
-- zh
----
-<div align="center">
-**VisCPM**
-**Chinese-English bilingual multi-modal large model series based on CPM (Chinese Pretrained Models) basic model**
 <p align="center">
-  <a href="https://github.com/OpenBMB/VisCPM">Github</a> •
-  <a href="https://huggingface.co/openbmb/VisCPM-Paint">VisCPM-Paint</a>
 </p>
-</div>
-`VisCPM` is a family of open-source large multimodal models, which support multimodal conversational capabilities (`VisCPM-Chat` model) and text-to-image generation capabilities (`VisCPM-Paint` model) in both Chinese and English, achieving state-of-the-art peformance among Chinese open-source multimodal models. VisCPM is trained based on the large language model [CPM-Bee](https://github.com/OpenBMB/CPM-Bee) with 10B parameters, fusing visual encoder (Q-Former) and visual decoder (Diffusion-UNet) to support visual inputs and outputs. Thanks to the good bilingual capability of CPM-Bee, `VisCPM` can be pre-trained with English multimodal data only and well generalize to achieve promising Chinese multimodal capabilities.
-- **👐 Open-source Usage**: VisCPM is free to be used for personal and research purposes. By open-sourcing the VisCPM model family, we hope to promote the development of the open-source community of large multimodal models and related research.
-- **🌟 Image and text generation coverage**: VisCPM models provide relatively comprehensive support for image and text multimodal capabilities, covering both multimodal conversation (image-to-text generation) capabilities and text-to-image generation capabilities.
-- **💫 Excellent bilingual performance**: Thanks to the excellent bilingual capability of the base language model CPM-Bee, VisCPM achieves outstanding results in both bilingual multimodal conversation and text-to-image generation.
 ## VisCPM-Chat
-`VisCPM-Chat` supports bilingual multimodal conversations involving images in both Chinese and English. The model utilizes `Q-Former` as the visual encoder and `CPM-Bee` (10B) as the base LLM. It combines visual and language models and is optimized with the language modeling training objective. The model training consists of two stages: pretraining and instruction tuning.
-* Pretraining: `VisCPM-Chat` is pretrained using approximately 100M high-quality English text-image pairs. The data sources include CC3M, CC12M, COCO, Visual Genome, Laion, etc. In this stage, the language model parameters remain fixed, and only the parameters of the `Q-Former` are updated to enable efficient alignment of vision and language representations.
-* Instruction Tuning: We utilize the [LLaVA-150K](https://llava-vl.github.io/) dataset that contains English multimodal instruction-following data. We mix this data with corresponding translated Chinese data to fine-tune the model and align its multimodal capabilities with user intents. In this stage, we update all model parameters to improve the data efficiency of instruction tuning. Interestingly, we observe that even when using only English instruction data for fine-tuning, the model can well comprehend Chinese questions but can only respond in English. This indicates that the model has achieved good generalization in terms of its multilingual and multimodal capabilities. By incorporating a small amount of translated Chinese data during the instruction tuning stage, we can align the model's response language with the user's question language.
-We evaluate the model on the standard [LLaVA English test set](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) and the translated [Chinese test set](data/translated_LLaVA_qa90) from the standard English test set. The evaluation benchmark examines the model's performance in conversation, detailed description, and complex reasoning, and uses GPT-4 for scoring. It can be observed that `VisCPM-Chat` achieves the best average performance in Chinese multimodal capabilities, excelling in conversation and complex reasoning, while also demonstrating good English multimodal capabilities. We provide two versions of the model, namely `VisCPM-Chat-balance` and `VisCPM-Chat-zhplus`. The former has a balanced ability in both English and Chinese, while the latter has a stronger emphasis on Chinese proficiency. Both models use the same data during the instruction tuning stage. `VisCPM-Chat-zhplus` additionally incorporates 20M cleaned native Chinese text-image pairs and 120M translated text-image pairs in Chinese during the pretraining stage.
 <table>
     <tr>
-        <td align="center" rowspan="2" colspan="2">Model</td>
-        <td align="center" rowspan="2">LLM Backbone</td>
-        <td align="center" colspan="4">English</td>
-        <td align="center" colspan="4">Chinese</td>
     </tr>
     <tr>
-        <td align="center">Conversation</td>
-        <td align="center">Detailed Description</td>
-        <td align="center">Complex Reasoning</td>
-        <td align="center">Avg</td>
-        <td align="center">Conversation</td>
-        <td align="center">Detailed Description</td>
-        <td align="center">Complex Reasoning</td>
-        <td align="center">Avg</td>
     </tr>
     <tr>
-        <td align="center" rowspan="3">English Model</td>
         <td align="center">MiniGPT4</td>
-        <td align="center">Vicuna-13B</td>
-        <td align="center">65.0</td>
         <td align="center">67.3</td>
         <td align="center">76.6</td>
         <td align="center">69.7</td>
@@ -63,9 +50,8 @@ We evaluate the model on the standard [LLaVA English test set](https://huggingfa
     </tr>
     <tr>
         <td align="center">InstructBLIP</td>
-        <td align="center">Vicuna-13B</td>
         <td align="center">81.9</td>
-        <td align="center">68.0</td>
         <td align="center">91.2</td>
         <td align="center">80.5</td>
         <td align="center">-</td>
@@ -75,20 +61,18 @@ We evaluate the model on the standard [LLaVA English test set](https://huggingfa
     </tr>
     <tr>
         <td align="center">LLaVA</td>
-        <td align="center">Vicuna-13B</td>
-        <td align="center"><b>89.5</b></td>
-        <td align="center"><b>70.4</b></td>
-        <td align="center"><b>96.2</b></td>
-        <td align="center"><b>85.6</b></td>
         <td align="center">-</td>
         <td align="center">-</td>
         <td align="center">-</td>
         <td align="center">-</td>
     </tr>
     <tr>
-        <td align="center" rowspan="5">En-Zh Bilingual Model</td>
         <td align="center">mPLUG-Owl </td>
-        <td align="center">LLaMA-7B</td>
         <td align="center">64.6</td>
         <td align="center">47.7</td>
         <td align="center">80.1</td>
@@ -96,61 +80,132 @@ We evaluate the model on the standard [LLaVA English test set](https://huggingfa
         <td align="center">76.3</td>
         <td align="center">61.2</td>
         <td align="center">77.8</td>
-        <td align="center">72.0</td>
     </tr>
     <tr>
         <td align="center">VisualGLM</td>
-        <td align="center">ChatGLM-6B</td>
         <td align="center">62.4</td>
-        <td align="center">63.0</td>
         <td align="center">80.6</td>
         <td align="center">68.7</td>
         <td align="center">76.6</td>
-        <td align="center"><b>87.8</b></td>
         <td align="center">83.6</td>
         <td align="center">82.7</td>
     </tr>
     <tr>
-        <td align="center">Ziya-Visual </td>
-        <td align="center">Ziya-LLaMA-13B-v1</td>
         <td align="center">82.7</td>
         <td align="center">69.9</td>
         <td align="center">92.1</td>
         <td align="center">81.7</td>
-        <td align="center">85.0</td>
         <td align="center">74.7</td>
         <td align="center">82.4</td>
         <td align="center">80.8</td>
     </tr>
     <tr>
-        <td align="center">VisCPM-Chat-balance</td>
-        <td align="center">CPMBee-10B</td>
         <td align="center">83.3</td>
         <td align="center">68.9</td>
         <td align="center">90.5</td>
         <td align="center">81.1</td>
-        <td align="center"><b>92.7</b></td>
         <td align="center">76.1</td>
         <td align="center">89.2</td>
         <td align="center">86.3</td>
     </tr>
     <tr>
-        <td align="center">VisCPM-Chat-zhplus</td>
-        <td align="center">CPMBee-10B</td>
-        <td align="center">80.1</td>
-        <td align="center">65.7</td>
-        <td align="center">92.5</td>
-        <td align="center">79.6</td>
-        <td align="center">90.3</td>
-        <td align="center">81.4</td>
-        <td align="center"><b>92.1</b></td>
-        <td align="center"><b>88.2</b></td>
     </tr>
 </table>
-## 📝 License
-VisCPM is governed by the [GML License](https://github.com/OpenBMB/General-Model-License/blob/main/%E9%80%9A%E7%94%A8%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE-%E6%9D%A5%E6%BA%90%E8%AF%B4%E6%98%8E-%E5%AE%A3%E4%BC%A0%E9%99%90%E5%88%B6-%E9%9D%9E%E5%95%86%E4%B8%9A%E5%8C%96.md), and permits individual and research usages. If you intend to utilize the model for commercial purposes, please reach out to cpm@modelbest.cn to negotiate commercial licensing.
-The CPM-Bee base, governed by the [General Model License (GML)](https://github.com/OpenBMB/General-Model-License/blob/main/%E9%80%9A%E7%94%A8%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE-%E6%9D%A5%E6%BA%90%E8%AF%B4%E6%98%8E-%E5%AE%A3%E4%BC%A0%E9%99%90%E5%88%B6-%E5%95%86%E4%B8%9A%E6%8E%88%E6%9D%83.md), permits commercial usage. If you intend to utilize the model for commercial purposes, please reach out to cpm@modelbest.cn to obtain the certificate of authorization.

+# VisCPM
+简体中文 | [English](README_en.md)
 <p align="center">
+<p align="left">
+    <a href="./LICENSE"><img src="https://img.shields.io/badge/license-Apache%202-dfd.svg"></a>
+    <a href=""><img src="https://img.shields.io/badge/python-3.8+-aff.svg"></a>
 </p>
+`VisCPM` is a family of open-source large multimodal models, which support multimodal conversational capabilities (`VisCPM-Chat` model) and text-to-image generation capabilities (`VisCPM-Paint` model) in both Chinese and English, achieving state-of-the-art peformance among Chinese open-source multimodal models. `VisCPM` is trained based on the large language model [CPM-Bee](https://github.com/OpenBMB/CPM-Bee) with 10B parameters, fusing visual encoder (Q-Former) and visual decoder (Diffusion-UNet) to support visual inputs and outputs. Thanks to the good bilingual capability of CPM-Bee, `VisCPM` can be pre-trained with English multimodal data only and well generalize to achieve promising Chinese multimodal capabilities.
+`VisCPM`是一个开源的多模态大模型系列，支持中英双语的多模态对话能力（`VisCPM-Chat`模型）和文到图生成能力（`VisCPM-Paint`模型），在中文多模态开源模型中达到最佳水平。`VisCPM`基于百亿参数量语言大模型[CPM-Bee](https://github.com/OpenBMB/CPM-Bee)（10B）训练，融合视觉编码器（`Q-Former`）和视觉解码器（`Diffusion-UNet`）以支持视觉信号的输入和输出。得益于`CPM-Bee`底座优秀的双语能力，`VisCPM`可以仅通过英文多模态数据预训练，泛化实现优秀的中文多模态能力。
 ## VisCPM-Chat
+`VisCPM-Chat`支持面向图像进行中英双语多模态对话。该模型使用`Q-Former`作为视觉编码器，使用CPM-Bee（10B）作为语言交互基底模型，并通过语言建模训练目标融合视觉和语言模型。模型训练包括预训练和指令精调两阶段：
+* 预训练：我们使用约100M高质量英文图文对数据对`VisCPM-Chat`进行了预训练，数据包括CC3M、CC12M、COCO、Visual Genome、Laion等。在预训练阶段，语言模型参数保持固定，仅更新`Q-Former`部分参数，以支持大规模视觉-语言表示的高效对齐。
+* 指令精调：我们采用[LLaVA-150K](https://llava-vl.github.io/)英文指令精调数据，并混合相应翻译后的中文数据对模型进行指令精调，以对齐模型多模态基础能力和用户使用意图。在指令精调阶段，我们更新全部模型参数，以提升指令精调数据的利用效率。有趣的是，我们发现即使仅采用英文指令数据进行指令精调，模型也可以理解中文问题，但仅能用英文回答。这表明模型的多语言多模态能力已经得到良好的泛化。在指令精调阶段进一步加入少量中文翻译数据，可以将模型回复语言和用户问题语言对齐。
+我们在LLaVA英文测试集和翻译的中文测试集对模型进行了评测，该评测基准考察模型在开放域对话、图像细节描述、复杂推理方面的表现，并使用GPT-4进行打分。可以观察到，`VisCPM-Chat`在中文多模态能力方面取得了最佳的平均性能，在通用域对话和复杂推理表现出色，同时也表现出了不错的英文多模态能力。
 <table>
     <tr>
+        <td align="center" rowspan="2" colspan="2">模型</td>
+        <td align="center" colspan="4">英文</td>
+        <td align="center" colspan="4">中文</td>
     </tr>
     <tr>
+        <td align="center">多模态对话</td>
+        <td align="center">细节描述</td>
+        <td align="center">复杂推理</td>
+        <td align="center">平均</td>
+        <td align="center">多模态对话</td>
+        <td align="center">细节描述</td>
+        <td align="center">复杂推理</td>
+        <td align="center">平均</td>
     </tr>
     <tr>
+        <td align="center" rowspan="3">英文模型</td>
         <td align="center">MiniGPT4</td>
+        <td align="center">65</td>
         <td align="center">67.3</td>
         <td align="center">76.6</td>
         <td align="center">69.7</td>
     </tr>
     <tr>
         <td align="center">InstructBLIP</td>
         <td align="center">81.9</td>
+        <td align="center">68</td>
         <td align="center">91.2</td>
         <td align="center">80.5</td>
         <td align="center">-</td>
     </tr>
     <tr>
         <td align="center">LLaVA</td>
+        <td align="center">89.5</td>
+        <td align="center">70.4</td>
+        <td align="center">96.2</td>
+        <td align="center">85.6</td>
         <td align="center">-</td>
         <td align="center">-</td>
         <td align="center">-</td>
         <td align="center">-</td>
     </tr>
     <tr>
+        <td align="center" rowspan="4">中英双语</td>
         <td align="center">mPLUG-Owl </td>
         <td align="center">64.6</td>
         <td align="center">47.7</td>
         <td align="center">80.1</td>
         <td align="center">76.3</td>
         <td align="center">61.2</td>
         <td align="center">77.8</td>
+        <td align="center">72</td>
     </tr>
     <tr>
         <td align="center">VisualGLM</td>
         <td align="center">62.4</td>
+        <td align="center">63</td>
         <td align="center">80.6</td>
         <td align="center">68.7</td>
         <td align="center">76.6</td>
+        <td align="center">87.8</td>
         <td align="center">83.6</td>
         <td align="center">82.7</td>
     </tr>
     <tr>
+        <td align="center">Ziya (LLaMA 13B)</td>
         <td align="center">82.7</td>
         <td align="center">69.9</td>
         <td align="center">92.1</td>
         <td align="center">81.7</td>
+        <td align="center">85</td>
         <td align="center">74.7</td>
         <td align="center">82.4</td>
         <td align="center">80.8</td>
     </tr>
     <tr>
+        <td align="center">VisCPM-Chat</td>
         <td align="center">83.3</td>
         <td align="center">68.9</td>
         <td align="center">90.5</td>
         <td align="center">81.1</td>
+        <td align="center">92.7</td>
         <td align="center">76.1</td>
         <td align="center">89.2</td>
         <td align="center">86.3</td>
     </tr>
+</table>
+## VisCPM-Paint
+`VisCPM-Paint`支持中英双语的文到图生成。该模型使用CPM-Bee（10B）作为文本编码器，使用`UNet`作为图像解码器，并通过扩散模型训练目标融合语言和视觉模型。在训练过程中，语言模型参数始终保持固定。我们使用[Stable Diffusion 2.1](https://github.com/Stability-AI/stablediffusion)的UNet参数初始化视觉解码器，并通过逐步解冻其中关键的桥接参数将其与语言模型融合：首先训练文本表示映射到视觉模型的线性层，然后进一步解冻`UNet`的交叉注意力层。该模型在[LAION 2B](https://laion.ai/)英文图文对数据上进行了训练。
+与`VisCPM-Chat`类似，我们发现得益于CPM-Bee的双语能力，`VisCPM-Paint`可以仅通过英文图文对训练，泛化实现良好的中文文到图生成能力，达到中文开源模型的最佳效果。通过进一步加入20M清洗后的原生中文图文对数据，以及120M翻译到中文的图文对数据，模型的中文文到图生成能力可以获得进一步提升。我们在MSCOCO上采样了3万张图片，计算了FID(Fréchet Inception Distance)和Clip Score，前者用于评估生成图片的质量，后面用于评估生成的图片与输入的匹配程度。
+<table>
+    <tr>
+        <td align="center" rowspan="2">模型</td>
+        <td align="center" colspan="2">英文</td>
+        <td align="center" colspan="2">中文</td>
+    </tr>
     <tr>
+        <td align="center">FID↓</td>
+        <td align="center">CLIP Score↑</td>
+        <td align="center">FID↓</td>
+        <td align="center">CLIP Score↑</td>
+    </tr>
+    <tr>
+        <td align="center">AltDiffusion</td>
+        <td align="center">17.16</td>
+        <td align="center">25.24</td>
+        <td align="center">16.09</td>
+        <td align="center">24.05</td>
+    </tr>
+    <tr>
+        <td align="center">TaiyiDiffusion</td>
+        <td align="center">-</td>
+        <td align="center">-</td>
+        <td align="center">15.58</td>
+        <td align="center">22.69</td>
+    </tr>
+    <tr>
+        <td align="center">Stable Diffusion</td>
+        <td align="center">9.08</td>
+        <td align="center">26.22</td>
+        <td align="center">-</td>
+        <td align="center">-</td>
+    </tr>
+    <tr>
+        <td align="center">VisCPM-Paint-en</td>
+        <td align="center">9.51</td>
+        <td align="center">25.35</td>
+        <td align="center">10.86</td>
+        <td align="center">23.38</td>
+    </tr>
+    <tr>
+        <td align="center">VisCPM-Paint-zh</td>
+        <td align="center">9.98</td>
+        <td align="center">25.04</td>
+        <td align="center">9.65</td>
+        <td align="center">24.17</td>
     </tr>
 </table>
+# 安装
+```Shell
+conda create -n viscpm python=3.10 -y
+conda activate viscpm
+pip install setuptools
+pip install diffusers jieba matplotlib numpy opencv_python
+pip install pandas Pillow psutil pydantic scipy
+pip install torch==1.13.1 torchscale==0.2.0 torchvision==0.14.1 timm
+pip install transformers==4.28.0
+pip install tqdm typing_extensions
+pip install git+https://github.com/thunlp/OpenDelta.git
+pip install git+https://github.com/OpenBMB/CPM-Bee.git#egg=cpm-live&subdirectory=src
+```
+VisCPM需要单卡40GB以上的GPU运行，我们会在尽快更新更加节省显存的推理方式。
+## 使用
+```python
+>>> from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
+>>> from PIL import Image
+>>> tokenizer = AutoTokenizer.from_pretrained('viscpm', trust_remote_code=True)
+>>> processor = AutoImageProcessor.from_pretrained('viscpm', trust_remote_code=True)
+>>> model = AutoModel.from_pretrained('viscpm', trust_remote_code=True).to('cuda')
+>>> data = [{
+>>>     'context': '',
+>>>     'question': 'describe this image in detail.',
+>>>     'image': tokenizer.unk_token * model.query_num,
+>>>     '<ans>': ''
+>>>     }]
+>>> image = Image.open('case.jpg')
+>>> result = model.generate(data, tokenizer, processor, image)
+>>> print(result[0]['<ans>'])
+这幅图片显示了一群热气球在天空中飞行。这些热气球漂浮在不同的地方，包括山脉、城市和乡村地区。
+```

README_en.md ADDED Viewed

	@@ -0,0 +1,162 @@

+# VisCPM
+[简体中文](README.md) | English
+<p align="center">
+<p align="left">
+    <a href="./LICENSE"><img src="https://img.shields.io/badge/license-Apache%202-dfd.svg"></a>
+    <a href=""><img src="https://img.shields.io/badge/python-3.8+-aff.svg"></a>
+</p>
+`VisCPM` is a family of open-source large multimodal models, which support multimodal conversational capabilities (`VisCPM-Chat` model) and text-to-image generation capabilities (`VisCPM-Paint` model) in both Chinese and English, achieving state-of-the-art peformance among Chinese open-source multimodal models. `VisCPM` is trained based on the large language model [CPM-Bee](https://github.com/OpenBMB/CPM-Bee) with 10B parameters, fusing visual encoder (`Q-Former`) and visual decoder (`Diffusion-UNet`) to support visual inputs and outputs. Thanks to the good bilingual capability of `CPM-Bee`, `VisCPM` can be pre-trained with English multimodal data only and well generalize to achieve promising Chinese multimodal capabilities.
+## VisCPM-Chat
+`VisCPM-Chat` supports bilingual multimodal conversations involving images in both Chinese and English. The model utilizes `Q-Former` as the visual encoder and CPM-Bee (10B) as the base LLM. It combines visual and language models through language modeling training objectives. The model training consists of two stages: pretraining and instruction fine-tuning.
+* Pretrain: `VisCPM-Chat` was pretrained using approximately 100 million high-quality English multimodal data pairs. The data sources include CC3M, CC12M, COCO, Visual Genome, Laion, and others. In this stage, the language model parameters remain fixed, and only the parameters of the `Q-Former` are updated to enable efficient alignment of large-scale visual-language representations.
+* Instruction fine-tuning: We utilized the [LLaVA-150K](https://llava-vl.github.io/) dataset, which consists of English multimodal instruction-following dataset. We mixed this data with corresponding translated Chinese data to fine-tune the model and align its multimodal capabilities with user intents. In this phase, we updated all model parameters to improve the utilization efficiency of the instruction fine-tuning data. Interestingly, we observed that even when using only English instruction data for fine-tuning, the model can comprehend Chinese questions but can only respond in English. This indicates that the model has achieved good generalization in terms of its multilingual and multimodal capabilities. By incorporating a small amount of translated Chinese data during the instruction fine-tuning phase, we can align the model's response language with the user's question language.
+We evaluated the model on the LLaVA English test set and the translated Chinese test set. The evaluation benchmark examined the model's performance in open-domain conversations, image detail descriptions, and complex reasoning tasks, using GPT-4 for scoring. It is evident that `VisCPM-Chat` achieved the best average performance in Chinese multimodal capabilities, excelling in general-domain conversations and complex reasoning. It also demonstrated commendable English multimodal abilities.
+<table>
+    <tr>
+        <td align="center" rowspan="2" colspan="2">Model</td>
+        <td align="center" colspan="4">English</td>
+        <td align="center" colspan="4">Chinese</td>
+    </tr>
+    <tr>
+        <td align="center">Conversation</td>
+        <td align="center">Detailed Description</td>
+        <td align="center">Complex Reasoning</td>
+        <td align="center">All</td>
+        <td align="center">Conversation</td>
+        <td align="center">Detailed Description</td>
+        <td align="center">Complex Reasoning</td>
+        <td align="center">All</td>
+    </tr>
+    <tr>
+        <td align="center" rowspan="3">English Model</td>
+        <td align="center">MiniGPT4</td>
+        <td align="center">65</td>
+        <td align="center">67.3</td>
+        <td align="center">76.6</td>
+        <td align="center">69.7</td>
+        <td align="center">-</td>
+        <td align="center">-</td>
+        <td align="center">-</td>
+        <td align="center">-</td>
+    </tr>
+    <tr>
+        <td align="center">InstructBLIP</td>
+        <td align="center">81.9</td>
+        <td align="center">68</td>
+        <td align="center">91.2</td>
+        <td align="center">80.5</td>
+        <td align="center">-</td>
+        <td align="center">-</td>
+        <td align="center">-</td>
+        <td align="center">-</td>
+    </tr>
+    <tr>
+        <td align="center">LLaVA</td>
+        <td align="center">89.5</td>
+        <td align="center">70.4</td>
+        <td align="center">96.2</td>
+        <td align="center">85.6</td>
+        <td align="center">-</td>
+        <td align="center">-</td>
+        <td align="center">-</td>
+        <td align="center">-</td>
+    </tr>
+    <tr>
+        <td align="center" rowspan="4">En-Zh Bilingual Model</td>
+        <td align="center">mPLUG-Owl </td>
+        <td align="center">64.6</td>
+        <td align="center">47.7</td>
+        <td align="center">80.1</td>
+        <td align="center">64.2</td>
+        <td align="center">76.3</td>
+        <td align="center">61.2</td>
+        <td align="center">77.8</td>
+        <td align="center">72</td>
+    </tr>
+    <tr>
+        <td align="center">VisualGLM</td>
+        <td align="center">62.4</td>
+        <td align="center">63</td>
+        <td align="center">80.6</td>
+        <td align="center">68.7</td>
+        <td align="center">76.6</td>
+        <td align="center">87.8</td>
+        <td align="center">83.6</td>
+        <td align="center">82.7</td>
+    </tr>
+    <tr>
+        <td align="center">Ziya (LLaMA 13B)</td>
+        <td align="center">82.7</td>
+        <td align="center">69.9</td>
+        <td align="center">92.1</td>
+        <td align="center">81.7</td>
+        <td align="center">85</td>
+        <td align="center">74.7</td>
+        <td align="center">82.4</td>
+        <td align="center">80.8</td>
+    </tr>
+    <tr>
+        <td align="center">VisCPM-Chat</td>
+        <td align="center">83.3</td>
+        <td align="center">68.9</td>
+        <td align="center">90.5</td>
+        <td align="center">81.1</td>
+        <td align="center">92.7</td>
+        <td align="center">76.1</td>
+        <td align="center">89.2</td>
+        <td align="center">86.3</td>
+    </tr>
+</table>
+# Install
+1. Clone this repository and navigate to source folder
+```bash
+git clone <github repo URL>
+cd viscpm
+```
+2. Install Package
+```Shell
+conda create -n viscpm python=3.10 -y
+conda activate viscpm
+pip install setuptools
+pip install diffusers jieba matplotlib numpy opencv_python
+pip install pandas Pillow psutil pydantic scipy
+pip install torch==1.13.1 torchscale==0.2.0 torchvision==0.14.1 timm
+pip install transformers==4.28.0
+pip install tqdm typing_extensions
+pip install git+https://github.com/thunlp/OpenDelta.git
+pip install git+https://github.com/OpenBMB/CPM-Bee.git#egg=cpm-live&subdirectory=src
+```
+`VisCPM` require GPUs with more than 40GB memory. We will soon update more memory-friendly inference methods.
+## How to use
+```python
+>>> from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
+>>> from PIL import Image
+>>> tokenizer = AutoTokenizer.from_pretrained('viscpm', trust_remote_code=True)
+>>> processor = AutoImageProcessor.from_pretrained('viscpm', trust_remote_code=True)
+>>> model = AutoModel.from_pretrained('viscpm', trust_remote_code=True).to('cuda')
+>>> data = [{
+>>>     'context': '',
+>>>     'question': 'describe this image in detail.',
+>>>     'image': tokenizer.unk_token * model.query_num,
+>>>     '<ans>': ''
+>>>     }]
+>>> image = Image.open('case.jpg')
+>>> result = model.generate(data, tokenizer, processor, image)
+>>> print(result[0]['<ans>'])
+这幅图片显示了一群热气球在天空中飞行。这些热气球漂浮在不同的地方，包括山脉、城市和乡村地区。
+```

beit3.py ADDED Viewed

	@@ -0,0 +1,108 @@

+# --------------------------------------------------------
+# Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks (https://arxiv.org/abs/2208.10442)
+# Github source: https://github.com/microsoft/unilm/tree/master/beit3
+# Copyright (c) 2023 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# --------------------------------------------------------'
+import math
+import torch
+import torch.nn as nn
+from timm.models.layers import trunc_normal_ as __call_trunc_normal_
+from timm.models.registry import register_model
+from torchscale.model.BEiT3 import BEiT3
+from torchscale.architecture.config import EncoderConfig
+def trunc_normal_(tensor, mean=0., std=1.):
+    __call_trunc_normal_(tensor, mean=mean, std=std, a=-std, b=std)
+def _get_base_config(
+        img_size=224, patch_size=16, drop_path_rate=0,
+        checkpoint_activations=None, mlp_ratio=4, vocab_size=64010, **kwargs
+):
+    return EncoderConfig(
+        img_size=img_size, patch_size=patch_size, vocab_size=vocab_size, multiway=True,
+        layernorm_embedding=False, normalize_output=True, no_output_layer=True,
+        drop_path_rate=drop_path_rate, encoder_embed_dim=768, encoder_attention_heads=12,
+        encoder_ffn_embed_dim=int(768 * mlp_ratio), encoder_layers=12,
+        checkpoint_activations=checkpoint_activations,
+    )
+def _get_large_config(
+        img_size=224, patch_size=16, drop_path_rate=0,
+        checkpoint_activations=None, mlp_ratio=4, vocab_size=64010, **kwargs
+):
+    return EncoderConfig(
+        img_size=img_size, patch_size=patch_size, vocab_size=vocab_size, multiway=True,
+        layernorm_embedding=False, normalize_output=True, no_output_layer=True,
+        drop_path_rate=drop_path_rate, encoder_embed_dim=1024, encoder_attention_heads=16,
+        encoder_ffn_embed_dim=int(1024 * mlp_ratio), encoder_layers=24,
+        checkpoint_activations=checkpoint_activations,
+    )
+class BEiT3Wrapper(nn.Module):
+    def __init__(self, args, **kwargs):
+        super().__init__()
+        self.args = args
+        self.beit3 = BEiT3(args)
+        self.apply(self._init_weights)
+        self.mim_head = nn.Linear(1024, 8192)
+        self.num_img_patches = self.beit3.vision_embed.num_position_embeddings()
+        self.hidden_size = args.encoder_embed_dim
+    def fix_init_weight(self):
+        def rescale(param, layer_id):
+            param.div_(math.sqrt(2.0 * layer_id))
+        for layer_id, layer in enumerate(self.blocks):
+            rescale(layer.attn.proj.weight.data, layer_id + 1)
+            rescale(layer.mlp.fc2.weight.data, layer_id + 1)
+    def get_num_layers(self):
+        return self.beit3.encoder.num_layers
+    @torch.jit.ignore
+    def no_weight_decay(self):
+        return {'pos_embed', 'cls_token', 'beit3.encoder.embed_positions.A.weight', 'beit3.vision_embed.cls_token', 'logit_scale'}
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=.02)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+    def forward(self, pixel_values, query_embed=None):
+        B = pixel_values.size(0)
+        dtype = self.beit3.vision_embed.proj.weight.dtype
+        pixel_values = pixel_values.to(dtype)
+        token_embeddings = self.beit3.vision_embed(pixel_values)
+        multiway_split_position = -1
+        if query_embed is not None:
+            query_embed = torch.stack([query_embed] * B)
+            multiway_split_position = token_embeddings.size(1)
+            token_embeddings = torch.cat([token_embeddings, query_embed], dim=1)
+        outputs = self.beit3.encoder(
+            src_tokens=None,
+            token_embeddings=token_embeddings,
+            multiway_split_position=multiway_split_position
+        )
+        vision_hidden_states = outputs["encoder_out"]
+        if query_embed is not None:
+            vision_hidden_states = vision_hidden_states[:, self.num_img_patches:]
+        return vision_hidden_states
+@register_model
+def beit3_large_patch16_224(pretrained=False, **kwargs):
+    args = _get_large_config(img_size=224, **kwargs)
+    model = BEiT3Wrapper(args, **kwargs)
+    return model

config.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+  "_from_model_config": true,
+  "_name_or_path": "openbmb/viscpmchat-bee-10b",
+  "architectures": [
+    "VisCpmBeeForCausalLM"
+  ],
+  "auto_map": {
+    "AutoConfig": "configuration_viscpmchatbee.VisCpmChatBeeConfig",
+    "AutoModel": "modeling_cpmbee.VisCpmBeeForCausalLM",
+    "AutoModelForCausalLM": "modeling_cpmbee.VisCpmBeeForCausalLM"
+  },
+  "vocab_size": 86583,
+  "hidden_size": 4096,
+  "dim_ff" : 10240,
+  "num_hidden_layers" : 48,
+  "num_attention_heads": 32,
+  "dim_head" : 128,
+  "dropout_p" : 0.0,
+  "position_bias_num_buckets" : 256,
+  "position_bias_num_segment_buckets": 256,
+  "position_bias_max_distance" : 2048,
+  "vision_dim": 1024,
+  "query_num": 64,
+  "eps" : 1e-6,
+  "half" : true,
+  "model_type": "viscpmchatbee"
+}

configuration_viscpmchatbee.py ADDED Viewed

	@@ -0,0 +1,133 @@

+# coding=utf-8
+# Copyright 2022 The OpenBMB Team and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" CpmBee model configuration"""
+from typing import List, Optional, Tuple, Union
+from transformers.configuration_utils import PretrainedConfig
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+CPMBEE_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "openbmb/viscpmchat-bee-10b": "https://huggingface.co/openbmb/VisCPM-Chat/resolve/main/config.json",
+    # See all VisCpmBee models at https://huggingface.co/models?filter=viscpmbee
+}
+class VisCpmChatBeeConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`CpmBeeModel`]. It is used to instbeeiate an
+    CPMBee model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of the CPMBee
+    [openbmb/cpm-bee-10b](https://huggingface.co/openbmb/cpm-bee-10b) architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30720):
+            Vocabulary size of the CPMBee model. Defines the number of different tokens that can be represented by the
+            `input` passed when calling [`CpmBeeModel`].
+        hidden_size (`int`, *optional*, defaults to 4096):
+            Dimension of the encoder layers.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads in the Transformer encoder.
+        dim_head (`int`, *optional*, defaults to 128):
+            Dimension of attention heads for each attention layer in the Transformer encoder.
+        dim_ff (`int`, *optional*, defaults to 10240):
+            Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        num_hidden_layers (`int`, *optional*, defaults to 48):
+            Number of layers of the Transformer encoder.
+        dropout_p (`float`, *optional*, defaults to 0.1):
+            The dropout probabilitiy for all fully connected layers in the embeddings, encoder.
+        position_bias_num_buckets (`int`, *optional*, defaults to 512):
+            The number of position_bias buckets.
+        position_bias_num_segment_buckets (`int`, *optional*, defaults to 32):
+            The number of segment buckets.
+        position_bias_max_distance (`int`, *optional*, defaults to 2048):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        eps (`float`, *optional*, defaults to 1e-6):
+            The epsilon used by the layer normalization layers.
+        init_std (`float`, *optional*, defaults to 1.0):
+            Initialize parameters with std = init_std.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether to use cache.
+        distance_scale (`float` or `int`, *optional*, defaults to 16):
+            Scale the rotary embedding.
+        mask_modules (`list` or `tuple`, *optional*, defaults to None):
+            Decides which feedforward block or attention block is pruned.
+        half (`bool`, *optional*, defaults to `False`):
+            Decides the model parameters are half-precision or not.
+    Example:
+    ```python
+    >>> from transformers import CpmBeeModel, CpmBeeConfig
+    >>> # Initializing a CPMBee cpm-bee-10b style configuration
+    >>> configuration = CpmBeeConfig()
+    >>> # Initializing a model from the cpm-bee-10b style configuration
+    >>> model = CpmBeeModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "viscpmchatbee"
+    def __init__(
+        self,
+        vocab_size: int = 30720,
+        hidden_size: int = 4096,
+        num_attention_heads: int = 64,
+        dim_head: int = 64,
+        dim_ff: int = 10240,
+        num_hidden_layers: int = 32,
+        dropout_p: int = 0.0,
+        position_bias_num_buckets: int = 256,
+        position_bias_num_segment_buckets: int = 32,
+        position_bias_max_distance: int = 2048,
+        eps: int = 1e-6,
+        init_std: float = 1.0,
+        use_cache: bool = True,
+        distance_scale: Union[int, float] = 16,
+        mask_modules: Optional[Union[List, Tuple]] = None,
+        half: bool = False,
+        vision_dim: int = 1024,
+        query_num: int = 64,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.position_bias_num_segment_buckets = position_bias_num_segment_buckets
+        self.hidden_size = hidden_size
+        self.num_attention_heads = num_attention_heads
+        self.dim_head = dim_head
+        self.dim_ff = dim_ff
+        self.num_hidden_layers = num_hidden_layers
+        self.position_bias_num_buckets = position_bias_num_buckets
+        self.position_bias_max_distance = position_bias_max_distance
+        self.dropout_p = dropout_p
+        self.eps = eps
+        self.use_cache = use_cache
+        self.vocab_size = vocab_size
+        self.init_std = init_std
+        self.distance_scale = distance_scale
+        self.half = half
+        self.mask_modules = mask_modules
+        self.vision_dim = vision_dim
+        self.query_num = query_num

feature_extraction_viscpmchatbee.py ADDED Viewed

	@@ -0,0 +1,17 @@

+import warnings
+from transformers.utils import logging
+from processing_viscpmchatbee import VisCpmChatBeeImageProcessor
+logger = logging.get_logger(__name__)
+class VisCpmChatBeeFeatureExtractor(VisCpmChatBeeImageProcessor):
+    def __init__(self, *args, **kwargs) -> None:
+        warnings.warn(
+            "The class VisCpmBeeFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please"
+            " use CLIPImageProcessor instead.",
+            FutureWarning,
+        )
+        super().__init__(*args, **kwargs)

generation_config.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+    "num_beams": 3,
+    "num_beam_groups": 1,
+    "do_sample": false,
+    "is_constraint_gen_mode": false,
+    "is_contrastive_search_gen_mode": false,
+    "pad_token_id": 0,
+    "eos_token_id": 7,
+    "bos_token_id": 6,
+    "max_new_tokens": 100,
+    "vocab_size": 86583
+}

modeling_cpmbee.py ADDED Viewed

The diff for this file is too large to render. See raw diff

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+    "image_processor_type": "VisCpmChatBeeImageProcessor",
+    "is_train": false,
+    "randaug": false,
+    "input_size": 224,
+    "interpolation": "bicubic",
+    "auto_map": {
+        "AutoImageProcessor": "processing_viscpmchatbee.VisCpmChatBeeImageProcessor"
+    }
+}

processing_viscpmchatbee.py ADDED Viewed

	@@ -0,0 +1,428 @@

+import cv2
+import numpy as np
+import torch
+from timm.data.constants import IMAGENET_INCEPTION_MEAN, IMAGENET_INCEPTION_STD
+from timm.data.transforms import RandomResizedCropAndInterpolation
+from torchvision import transforms
+import urllib
+from tqdm import tqdm
+from cpm_live.tokenizers import CPMBeeTokenizer
+from torch.utils.data import default_collate
+from typing import Any, Callable, Dict, List, Optional, Tuple, Union
+from typing_extensions import TypedDict
+from numpy.typing import NDArray
+import importlib.machinery
+import importlib.util
+import types
+import random
+from transformers.image_utils import make_list_of_images
+from transformers.image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
+from transformers import TensorType
+import json
+# aug functions
+def identity_func(img):
+    return img
+def autocontrast_func(img, cutoff=0):
+    '''
+        same output as PIL.ImageOps.autocontrast
+    '''
+    n_bins = 256
+    def tune_channel(ch):
+        n = ch.size
+        cut = cutoff * n // 100
+        if cut == 0:
+            high, low = ch.max(), ch.min()
+        else:
+            hist = cv2.calcHist([ch], [0], None, [n_bins], [0, n_bins])
+            low = np.argwhere(np.cumsum(hist) > cut)
+            low = 0 if low.shape[0] == 0 else low[0]
+            high = np.argwhere(np.cumsum(hist[::-1]) > cut)
+            high = n_bins - 1 if high.shape[0] == 0 else n_bins - 1 - high[0]
+        if high <= low:
+            table = np.arange(n_bins)
+        else:
+            scale = (n_bins - 1) / (high - low)
+            table = np.arange(n_bins) * scale - low * scale
+            table[table < 0] = 0
+            table[table > n_bins - 1] = n_bins - 1
+        table = table.clip(0, 255).astype(np.uint8)
+        return table[ch]
+    channels = [tune_channel(ch) for ch in cv2.split(img)]
+    out = cv2.merge(channels)
+    return out
+def equalize_func(img):
+    '''
+        same output as PIL.ImageOps.equalize
+        PIL's implementation is different from cv2.equalize
+    '''
+    n_bins = 256
+    def tune_channel(ch):
+        hist = cv2.calcHist([ch], [0], None, [n_bins], [0, n_bins])
+        non_zero_hist = hist[hist != 0].reshape(-1)
+        step = np.sum(non_zero_hist[:-1]) // (n_bins - 1)
+        if step == 0:
+            return ch
+        n = np.empty_like(hist)
+        n[0] = step // 2
+        n[1:] = hist[:-1]
+        table = (np.cumsum(n) // step).clip(0, 255).astype(np.uint8)
+        return table[ch]
+    channels = [tune_channel(ch) for ch in cv2.split(img)]
+    out = cv2.merge(channels)
+    return out
+def rotate_func(img, degree, fill=(0, 0, 0)):
+    '''
+    like PIL, rotate by degree, not radians
+    '''
+    H, W = img.shape[0], img.shape[1]
+    center = W / 2, H / 2
+    M = cv2.getRotationMatrix2D(center, degree, 1)
+    out = cv2.warpAffine(img, M, (W, H), borderValue=fill)
+    return out
+def solarize_func(img, thresh=128):
+    '''
+        same output as PIL.ImageOps.posterize
+    '''
+    table = np.array([el if el < thresh else 255 - el for el in range(256)])
+    table = table.clip(0, 255).astype(np.uint8)
+    out = table[img]
+    return out
+def color_func(img, factor):
+    '''
+        same output as PIL.ImageEnhance.Color
+    '''
+    # implementation according to PIL definition, quite slow
+    #  degenerate = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)[:, :, np.newaxis]
+    #  out = blend(degenerate, img, factor)
+    #  M = (
+    #      np.eye(3) * factor
+    #      + np.float32([0.114, 0.587, 0.299]).reshape(3, 1) * (1. - factor)
+    #  )[np.newaxis, np.newaxis, :]
+    M = (
+        np.float32([
+            [0.886, -0.114, -0.114],
+            [-0.587, 0.413, -0.587],
+            [-0.299, -0.299, 0.701]]) * factor
+        + np.float32([[0.114], [0.587], [0.299]])
+    )
+    out = np.matmul(img, M).clip(0, 255).astype(np.uint8)
+    return out
+def contrast_func(img, factor):
+    """
+        same output as PIL.ImageEnhance.Contrast
+    """
+    mean = np.sum(np.mean(img, axis=(0, 1)) * np.array([0.114, 0.587, 0.299]))
+    table = np.array([(
+        el - mean) * factor + mean
+        for el in range(256)
+    ]).clip(0, 255).astype(np.uint8)
+    out = table[img]
+    return out
+def brightness_func(img, factor):
+    '''
+        same output as PIL.ImageEnhance.Contrast
+    '''
+    table = (np.arange(256, dtype=np.float32) * factor).clip(0, 255).astype(np.uint8)
+    out = table[img]
+    return out
+def sharpness_func(img, factor):
+    '''
+    The differences the this result and PIL are all on the 4 boundaries, the center
+    areas are same
+    '''
+    kernel = np.ones((3, 3), dtype=np.float32)
+    kernel[1][1] = 5
+    kernel /= 13
+    degenerate = cv2.filter2D(img, -1, kernel)
+    if factor == 0.0:
+        out = degenerate
+    elif factor == 1.0:
+        out = img
+    else:
+        out = img.astype(np.float32)
+        degenerate = degenerate.astype(np.float32)[1:-1, 1:-1, :]
+        out[1:-1, 1:-1, :] = degenerate + factor * (out[1:-1, 1:-1, :] - degenerate)
+        out = out.astype(np.uint8)
+    return out
+def shear_x_func(img, factor, fill=(0, 0, 0)):
+    H, W = img.shape[0], img.shape[1]
+    M = np.float32([[1, factor, 0], [0, 1, 0]])
+    out = cv2.warpAffine(img, M, (W, H), borderValue=fill, flags=cv2.INTER_LINEAR).astype(np.uint8)
+    return out
+def translate_x_func(img, offset, fill=(0, 0, 0)):
+    '''
+        same output as PIL.Image.transform
+    '''
+    H, W = img.shape[0], img.shape[1]
+    M = np.float32([[1, 0, -offset], [0, 1, 0]])
+    out = cv2.warpAffine(img, M, (W, H), borderValue=fill, flags=cv2.INTER_LINEAR).astype(np.uint8)
+    return out
+def translate_y_func(img, offset, fill=(0, 0, 0)):
+    '''
+        same output as PIL.Image.transform
+    '''
+    H, W = img.shape[0], img.shape[1]
+    M = np.float32([[1, 0, 0], [0, 1, -offset]])
+    out = cv2.warpAffine(img, M, (W, H), borderValue=fill, flags=cv2.INTER_LINEAR).astype(np.uint8)
+    return out
+def posterize_func(img, bits):
+    '''
+        same output as PIL.ImageOps.posterize
+    '''
+    out = np.bitwise_and(img, np.uint8(255 << (8 - bits)))
+    return out
+def shear_y_func(img, factor, fill=(0, 0, 0)):
+    H, W = img.shape[0], img.shape[1]
+    M = np.float32([[1, 0, 0], [factor, 1, 0]])
+    out = cv2.warpAffine(img, M, (W, H), borderValue=fill, flags=cv2.INTER_LINEAR).astype(np.uint8)
+    return out
+def cutout_func(img, pad_size, replace=(0, 0, 0)):
+    replace = np.array(replace, dtype=np.uint8)
+    H, W = img.shape[0], img.shape[1]
+    rh, rw = np.random.random(2)
+    pad_size = pad_size // 2
+    ch, cw = int(rh * H), int(rw * W)
+    x1, x2 = max(ch - pad_size, 0), min(ch + pad_size, H)
+    y1, y2 = max(cw - pad_size, 0), min(cw + pad_size, W)
+    out = img.copy()
+    out[x1:x2, y1:y2, :] = replace
+    return out
+# level to args
+def enhance_level_to_args(MAX_LEVEL):
+    def level_to_args(level):
+        return ((level / MAX_LEVEL) * 1.8 + 0.1,)
+    return level_to_args
+def shear_level_to_args(MAX_LEVEL, replace_value):
+    def level_to_args(level):
+        level = (level / MAX_LEVEL) * 0.3
+        if np.random.random() > 0.5:
+            level = -level
+        return (level, replace_value)
+    return level_to_args
+def translate_level_to_args(translate_const, MAX_LEVEL, replace_value):
+    def level_to_args(level):
+        level = (level / MAX_LEVEL) * float(translate_const)
+        if np.random.random() > 0.5:
+            level = -level
+        return (level, replace_value)
+    return level_to_args
+def cutout_level_to_args(cutout_const, MAX_LEVEL, replace_value):
+    def level_to_args(level):
+        level = int((level / MAX_LEVEL) * cutout_const)
+        return (level, replace_value)
+    return level_to_args
+def solarize_level_to_args(MAX_LEVEL):
+    def level_to_args(level):
+        level = int((level / MAX_LEVEL) * 256)
+        return (level, )
+    return level_to_args
+def none_level_to_args(level):
+    return ()
+def posterize_level_to_args(MAX_LEVEL):
+    def level_to_args(level):
+        level = int((level / MAX_LEVEL) * 4)
+        return (level, )
+    return level_to_args
+def rotate_level_to_args(MAX_LEVEL, replace_value):
+    def level_to_args(level):
+        level = (level / MAX_LEVEL) * 30
+        if np.random.random() < 0.5:
+            level = -level
+        return (level, replace_value)
+    return level_to_args
+func_dict = {
+    'Identity': identity_func,
+    'AutoContrast': autocontrast_func,
+    'Equalize': equalize_func,
+    'Rotate': rotate_func,
+    'Solarize': solarize_func,
+    'Color': color_func,
+    'Contrast': contrast_func,
+    'Brightness': brightness_func,
+    'Sharpness': sharpness_func,
+    'ShearX': shear_x_func,
+    'TranslateX': translate_x_func,
+    'TranslateY': translate_y_func,
+    'Posterize': posterize_func,
+    'ShearY': shear_y_func,
+}
+translate_const = 10
+MAX_LEVEL = 10
+replace_value = (128, 128, 128)
+arg_dict = {
+    'Identity': none_level_to_args,
+    'AutoContrast': none_level_to_args,
+    'Equalize': none_level_to_args,
+    'Rotate': rotate_level_to_args(MAX_LEVEL, replace_value),
+    'Solarize': solarize_level_to_args(MAX_LEVEL),
+    'Color': enhance_level_to_args(MAX_LEVEL),
+    'Contrast': enhance_level_to_args(MAX_LEVEL),
+    'Brightness': enhance_level_to_args(MAX_LEVEL),
+    'Sharpness': enhance_level_to_args(MAX_LEVEL),
+    'ShearX': shear_level_to_args(MAX_LEVEL, replace_value),
+    'TranslateX': translate_level_to_args(
+        translate_const, MAX_LEVEL, replace_value
+    ),
+    'TranslateY': translate_level_to_args(
+        translate_const, MAX_LEVEL, replace_value
+    ),
+    'Posterize': posterize_level_to_args(MAX_LEVEL),
+    'ShearY': shear_level_to_args(MAX_LEVEL, replace_value),
+}
+class RandomAugment(object):
+    def __init__(self, N=2, M=10, isPIL=False, augs=[]):
+        self.N = N
+        self.M = M
+        self.isPIL = isPIL
+        if augs:
+            self.augs = augs
+        else:
+            self.augs = list(arg_dict.keys())
+    def get_random_ops(self):
+        sampled_ops = np.random.choice(self.augs, self.N)
+        return [(op, 0.5, self.M) for op in sampled_ops]
+    def __call__(self, img):
+        if self.isPIL:
+            img = np.array(img)
+        ops = self.get_random_ops()
+        for name, prob, level in ops:
+            if np.random.random() > prob:
+                continue
+            args = arg_dict[name](level)
+            img = func_dict[name](img, *args)
+        return img
+def build_transform(is_train, randaug=True, input_size=224, interpolation='bicubic'):
+    if is_train:
+        t = [
+            RandomResizedCropAndInterpolation(
+                input_size, scale=(0.5, 1.0), interpolation=transforms.InterpolationMode.BICUBIC),
+            transforms.RandomHorizontalFlip(),
+        ]
+        if randaug:
+            t.append(
+                RandomAugment(
+                    2, 7, isPIL=True,
+                    augs=[
+                        'Identity', 'AutoContrast', 'Equalize', 'Brightness', 'Sharpness',
+                        'ShearX', 'ShearY', 'TranslateX', 'TranslateY', 'Rotate',
+                    ]))
+        t += [
+            transforms.ToTensor(),
+            transforms.Normalize(mean=IMAGENET_INCEPTION_MEAN, std=IMAGENET_INCEPTION_STD),
+        ]
+        t = transforms.Compose(t)
+    else:
+        t = transforms.Compose([
+            transforms.Resize((input_size, input_size),
+                              interpolation=transforms.InterpolationMode.BICUBIC),
+            transforms.ToTensor(),
+            transforms.Normalize(mean=IMAGENET_INCEPTION_MEAN, std=IMAGENET_INCEPTION_STD)
+        ])
+    return t
+class VisCpmChatBeeImageProcessor(BaseImageProcessor):
+    def __init__(self, is_train, randaug=True, input_size=224, interpolation='bicubic', **kwargs):
+        super().__init__(**kwargs)
+        self.is_train = is_train
+        self.randaug = randaug
+        self.input_size = input_size
+        self.interpolation = interpolation
+        self._transform = build_transform(is_train, randaug=randaug, input_size=input_size, interpolation=interpolation)
+    def preprocess(self, images, return_tensors: Optional[Union[str, TensorType]] = None, **kwargs) -> BatchFeature:
+        images = make_list_of_images(images)
+        images = [self._transform(image) for image in images]
+        images = torch.tensor([image.numpy() for image in images])
+        data = {"pixel_values": images}
+        return BatchFeature(data=data, tensor_type=return_tensors)
+    def to_json_string(self) -> str:
+        """
+        Serializes this instance to a JSON string.
+        Returns:
+            `str`: String containing all the attributes that make up this feature_extractor instance in JSON format.
+        """
+        dictionary = self.to_dict()
+        for key, value in dictionary.items():
+            if isinstance(value, np.ndarray):
+                dictionary[key] = value.tolist()
+        # make sure private name "_processor_class" is correctly
+        # saved as "processor_class"
+        _processor_class = dictionary.pop("_processor_class", None)
+        if _processor_class is not None:
+            dictionary["processor_class"] = _processor_class
+        _transform = dictionary.pop("_transform", None)
+        if _transform is not None:
+            dictionary["_transform"] = str(type(_transform))
+        return json.dumps(dictionary, indent=2, sort_keys=True) + "\n"

tokenization_viscpmchatbee.py ADDED Viewed

	@@ -0,0 +1,1007 @@

+# coding=utf-8
+# Copyright 2022 The OpenBMB Team and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes for CpmBee."""
+import json
+import os
+from typing import Any, Dict, List, Optional, Tuple, Union
+import numpy as np
+from numpy.typing import NDArray
+from typing_extensions import TypedDict
+from transformers.tokenization_utils import PaddingStrategy, PreTrainedTokenizer, TensorType
+from transformers.tokenization_utils_base import AddedToken, BatchEncoding, TextInput, TruncationStrategy
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt"}
+PRETRAINED_VOCAB_FILES_MAP = {
+    "vocab_file": {
+        "openbmb/viscpmchat-bee-10b": "https://huggingface.co/openbmb/VisCPM-Chat/blob/main/vocab.txt",
+    },
+}
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "openbmb/viscpmchat-bee-10b": 4096,
+}
+class _PrevExtTableStates(TypedDict):
+    ext_table: Dict[int, str]
+    token_id_table: Dict[str, Dict[int, int]]
+CPMBeeInputType = Union[str, Dict[str, "CPMBeeInputType"]]
+def rel_to_bucket(n_up: int, n_down: int, max_depth: int = 8):
+    ret = n_up * max_depth + n_down
+    if ret == 0:
+        return ret
+    else:
+        # bucket 1 is reserved for incontext samples
+        return ret + 1
+class _DictTree(TypedDict):
+    value: str
+    children: List["_DictTree"]
+    depth: int
+    segment_id: int
+    need_predict: bool
+    is_image: bool
+class VisCpmChatBeeTokenizer(PreTrainedTokenizer):
+    """
+    Construct a CPMBee tokenizer.
+    Args:
+        vocab_file (`str`):
+            Path to the vocabulary file.
+        bos_token (`str`, *optional*, defaults to `"<s>"`):
+            The beginning of sequence token.
+        eos_token (`str`, *optional*, defaults to `"</s>"`):
+            The end of sequence token.
+        line_token (`str`, *optional*, defaults to `"\n"`):
+            The line token.
+        space_token (`str`, *optional*, defaults to `" "`):
+            The space token.
+        unk_token (`str`, *optional*, defaults to `"<unk>"`):
+            The unknown token.
+        mask_token (`str`, *optional*, defaults to `"<mask>"`):
+            The mask token.
+        pad_token (`str`, *optional*, defaults to `"<pad>"`):
+            The token used for padding.
+        padding_side (`str`, *optional*, defaults to `"left"`):
+            The padding side. CPM-Bee will use left padding by default.
+    """
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+    model_input_names: List[str] = [
+        "input_ids",
+        "attention_mask",
+        "input_id_sub",
+        "position",
+        "context",
+        "sample_ids",
+        "num_segments",
+        "segment",
+        "segment_rel_offset",
+        "segment_rel",
+    ]
+    add_prefix_space = False
+    def __init__(
+        self,
+        vocab_file,
+        bos_token="<s>",
+        eos_token="</s>",
+        line_token="\n",
+        space_token=" ",
+        unk_token="<unk>",
+        mask_token="<mask>",
+        pad_token="<pad>",
+        padding_side="left",
+        **kwargs,
+    ):
+        super().__init__(
+            bos_token=bos_token,
+            eos_token=eos_token,
+            line_token=line_token,
+            space_token=space_token,
+            unk_token=unk_token,
+            mask_token=mask_token,
+            pad_token=pad_token,
+            padding_side=padding_side,
+            **kwargs,
+        )
+        self.encoder: Dict[str, int] = {}
+        with open(vocab_file, "r", encoding="utf-8") as reader:
+            for token in reader.readlines():
+                token = token.rstrip("\n")
+                if len(token) == 0:
+                    continue
+                self.encoder[token] = len(self.encoder)
+        self.encoder[" "] = self.encoder["</_>"]
+        self.encoder["\n"] = self.encoder["</n>"]
+        del self.encoder["</_>"]
+        del self.encoder["</n>"]
+        self.decoder = {v: k for k, v in self.encoder.items()}
+        self._max_word_len = max([len(x) for x in self.encoder.keys()])
+        self.cpmbee_special_tokens = {k: v for k, v in self.encoder.items() if k.startswith("<") and k.endswith(">")}
+        self.ext_table: Dict[int, str] = {}
+        self.ext_table_rev: Dict[str, int] = {}
+        self.token_id_table: Dict[str, Dict[int, int]] = {}
+        self.ext_special_tokens = []
+        self.ext_args_for_model = [
+            "input_id_subs",
+            "input_pos",
+            "context",
+            "segment_ids",
+            "segment_rel_offset",
+            "segment_rel",
+            "sample_ids",
+            "num_segments",
+            "predict_segments",
+            "answer_placeholders",
+            "ext_table",
+            "token_id_table",
+            "image_bound"
+        ]
+    @property
+    def bod_token_id(self):
+        return self.encoder[self.bod_token]
+    @property
+    def eod_token_id(self):
+        return self.encoder[self.eod_token]
+    @property
+    def newline_id(self):
+        return self.encoder[self.line_token]
+    @property
+    def vocab_size(self) -> int:
+        return len(self.encoder)
+    def __len__(self):
+        """
+        Size of the full vocabulary with the added tokens.
+        """
+        return self.vocab_size + len(self.added_tokens_encoder)
+    def get_vocab(self):
+        return dict(self.encoder, **self.added_tokens_encoder)
+    def get_piece(self, text: str) -> str:
+        """
+        Match with maximum length.
+        """
+        len_text = len(text)
+        for i in range(len(text)):
+            sub = text[: len_text - i]
+            if (sub in self.encoder) or (sub in self.added_tokens_encoder):
+                return sub
+        return text[0]
+    def tokenize(self, text: TextInput, **kwargs) -> List[str]:
+        r"""
+        Override the `tokenize` to meet the needs of CPMBee:
+        1. Mark the special token with `<` and `>`. The `<>` will be ignored.
+        2. Split sentences by the marked special tokens.
+        3. Record the marked special token by `ext_table` and `ext_table_rev`.
+        4. Tokenize the sentence without special tokens.
+        """
+        for_cpmbee = kwargs.get("for_cpmbee", False)
+        all_special_tokens_extended = {
+            str(t): t for t in self.all_special_tokens_extended if isinstance(t, AddedToken)
+        }
+        sentence_split = [""]
+        is_special_token = False
+        for i, c in enumerate(text):
+            if is_special_token:
+                if c == "<":
+                    tail = sentence_split.pop(-1)
+                    sentence_split[-1] += tail
+                    sentence_split.append(c)
+                elif c == ">":
+                    # end of special token
+                    sentence_split[-1] += c
+                    if sentence_split[-1] == "<>":
+                        continue
+                    is_special_token = False
+                    sentence_split.append("")
+                else:
+                    sentence_split[-1] += c
+            else:
+                if c == "<":
+                    is_special_token = True
+                    sentence_split.append(c)
+                else:
+                    sentence_split[-1] += c
+        if is_special_token:
+            tail = sentence_split.pop(-1)
+            sentence_split[-1] += tail
+        output_tokens = []
+        for i, part in enumerate(sentence_split):
+            if (i & 1) == 1:
+                # special token
+                output_tokens.append(part)
+                if for_cpmbee and (part not in self.encoder) and (part not in self.ext_table_rev):
+                    self.ext_table_rev[part] = len(self.ext_table_rev) + self.vocab_size
+                    self.ext_table[self.ext_table_rev[part]] = part
+            else:
+                output_tokens.extend(self._tokenize(part, for_cpmbee=for_cpmbee))
+        # drop spaces
+        for i, token in enumerate(output_tokens):
+            if token in self.added_tokens_encoder:
+                token = all_special_tokens_extended.get(token, None)
+                left = output_tokens[i - 1] if i > 0 else None
+                right = output_tokens[i + 1] if i < len(output_tokens) - 1 else None
+                if isinstance(token, AddedToken):
+                    if token.rstrip and right:
+                        # A bit counter-intuitive but we strip the left of the string
+                        # since tok_extended.rstrip means the special token is eating all white spaces on its right
+                        output_tokens[i + 1] = right.lstrip()
+                    # Strip white spaces on the left
+                    if token.lstrip and left:
+                        output_tokens[i - 1] = left.rstrip()  # Opposite here
+                else:
+                    if right:
+                        output_tokens[i + 1] = right.lstrip()
+                    if left:
+                        output_tokens[i - 1] = left.rstrip()
+        skipped_tokens = []
+        for token in output_tokens:
+            if not token:
+                continue
+            else:
+                skipped_tokens.append(token)
+        return skipped_tokens
+    def _tokenize(self, text, **kwargs):
+        """
+        Converts a string in a sequence of tokens (string), using the tokenizer. Split in words for word-based
+        vocabulary.
+        Do NOT take care of added tokens. Record the unk tokens and special tokens in `ext_table` and `ext_table_rev`.
+        """
+        for_cpmbee = kwargs.get("for_cpmbee", False)
+        output_tokens = []
+        part_st = 0
+        last_unk = None
+        while part_st < len(text):
+            piece = self.get_piece(text[part_st:])
+            if piece in self.encoder or self.added_tokens_encoder:
+                if last_unk is None:
+                    output_tokens.append(piece)
+                else:
+                    if for_cpmbee and (last_unk not in self.ext_table_rev):
+                        self.ext_table_rev[last_unk] = len(self.ext_table_rev) + self.vocab_size
+                        self.ext_table[self.ext_table_rev[last_unk]] = last_unk
+                    output_tokens.append(last_unk)
+                    output_tokens.append(piece)
+                    last_unk = None
+            else:
+                if last_unk is None:
+                    last_unk = piece
+                else:
+                    last_unk += piece
+            part_st += len(piece)
+        if last_unk is not None:
+            # part end with UNK
+            if for_cpmbee and (last_unk not in self.ext_table_rev):
+                self.ext_table_rev[last_unk] = len(self.ext_table_rev) + self.vocab_size
+                self.ext_table[self.ext_table_rev[last_unk]] = last_unk
+            output_tokens.append(last_unk)
+        return output_tokens
+    def check(self, token):
+        return token in self.encoder
+    def convert_tokens_to_string(self, tokens: List[str]) -> str:
+        return "".join(tokens)
+    def _convert_token_to_id(self, token: str):
+        """Converts a token (str) in an id using the vocab and ext_table."""
+        if token in self.encoder:
+            return self.encoder.get(token)
+        elif token in self.ext_table_rev:
+            return self.ext_table_rev[token]
+        elif token in self.added_tokens_encoder:
+            return self.added_tokens_encoder[token]
+        else:
+            return self.unk_token_id
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab and ext_table."""
+        if index in self.ext_table:
+            return self.ext_table[index]
+        elif index in self.added_tokens_decoder:
+            return self.added_tokens_decoder[index]
+        else:
+            if index >= 0:
+                return self.decoder[index]
+    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        if os.path.isdir(save_directory):
+            vocab_file = os.path.join(
+                save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
+            )
+        else:
+            vocab_file = (filename_prefix + "-" if filename_prefix else "") + save_directory
+        index = 0
+        self.encoder["</n>"] = self.encoder["\n"]
+        del self.encoder["\n"]
+        self.encoder["</_>"] = self.encoder[" "]
+        del self.encoder[" "]
+        with open(vocab_file, "w", encoding="utf-8") as writer:
+            for token, token_index in sorted(self.encoder.items(), key=lambda x: x[1]):
+                if index != token_index:
+                    logger.warning(
+                        f"Saving vocabulary to {vocab_file}: vocabulary indices are not consecutive."
+                        " Please check that the vocabulary is not corrupted!"
+                    )
+                    index = token_index
+                writer.write(token + "\n")
+                index += 1
+        return (vocab_file,)
+    def __call__(self, text, *args, **kwargs):
+        r"""
+        CPMBee `call` method will use `_tokenize_cpmbee` when the input type is dict.
+        """
+        if isinstance(text, dict):
+            return self._batch_tokenize_cpmbee([text], *args, **kwargs)
+        elif isinstance(text, (list, tuple)):
+            if isinstance(text[0], dict):
+                return self._batch_tokenize_cpmbee(text, *args, **kwargs)
+            else:
+                return super().__call__(text, *args, **kwargs)
+        else:
+            return super().__call__(text, *args, **kwargs)
+    # 分词
+    def _tokenize_cpmbee(self, data: TextInput, *args, **kwargs) -> List[str]:
+        """
+        A tokenize method to process dict data. Exclusive for CPMBee.
+        """
+        if isinstance(data, str):
+            data = json.loads(data)
+        if not isinstance(data, Dict):
+            raise TypeError(
+                "CpmBeeTokenizer input data should be dict or str in dict format, but got {}".format(type(data))
+            )
+        # 1. prepare answer placeholder
+        answer_placeholders = []
+        def _put_placeholder(data: Any, path: List[str] = []):
+            if isinstance(data, dict):
+                ret = {}
+                for k, v in data.items():
+                    ret[k] = _put_placeholder(v, path + [k])
+                return ret
+            else:
+                answer_placeholders.append(path)
+                return "<ans_{}>".format(len(answer_placeholders))
+        data["<ans>"] = _put_placeholder(data["<ans>"])
+        (
+            input_ids,
+            input_id_subs,
+            context,
+            segment_ids,
+            segment_rel,
+            n_segments,
+            table_states,
+            image_bound
+        ) = self.convert_data_to_id(data, shuffle_answer=False, max_depth=8)
+        # <ans> mapping from sub to id
+        sub_ans_map: Dict[int, int] = {}
+        for fake_id, token_sub in table_states["token_id_table"]["<ans>"].items():
+            token = table_states["ext_table"][fake_id]
+            if token.startswith("<ans_") and token.endswith(">"):
+                ans_id = int(token[5:-1])
+                sub_ans_map[token_sub] = ans_id
+        tmp_input_ids = []
+        tmp_input_sub = []
+        tmp_input_seg = []
+        # get predict segments
+        predict_segments: List[Tuple[int, int]] = []
+        for i in range(input_ids.shape[0]):
+            if context[i] == 0:
+                if input_ids[i] == self.encoder["<ans>"]:
+                    # is ans
+                    # (segment_id, ans_id)
+                    predict_segments.append((segment_ids[i], sub_ans_map[input_id_subs[i]]))
+            else:
+                tmp_input_ids.append(input_ids[i])
+                tmp_input_sub.append(input_id_subs[i])
+                tmp_input_seg.append(segment_ids[i])
+        if len(predict_segments) == 0:
+            raise ValueError("No answer to predict")
+        input_ids = np.array(tmp_input_ids, dtype=np.int32)  # all context
+        input_id_subs = np.array(tmp_input_sub, dtype=np.int32)  # [0, 0, 0, 0, 1, 0, 0, 2, 0, ...]
+        context = np.full_like(tmp_input_ids, 1, dtype=np.int8)  # [1, 1, 1, ...]
+        segment_ids = np.array(tmp_input_seg, dtype=np.int32)  # [0, 0, 0, 1, 1, 1, 2, 2, 2, 2, ...]
+        sample_ids = np.zeros(input_ids.shape, dtype=np.int32)  # [0, 0, 0, 0, ...]
+        segment_rel_offset = np.zeros(input_ids.shape, dtype=np.int32)  # [0, 0, 0, ...]
+        num_segments = np.full(input_ids.shape, n_segments, dtype=np.int32)  # [n_seg, n_seg, n_seg, ...]
+        input_pos = np.arange(input_ids.shape[0], dtype=np.int32)  # [0, 1, 2, 3, 4, ...]
+        image_bound = np.array(image_bound)
+        return (
+            self.prepare_for_model(
+                input_ids.tolist(),
+                input_id_subs=input_id_subs.tolist(),
+                input_pos=input_pos.tolist(),
+                context=context.tolist(),
+                segment_ids=segment_ids.tolist(),
+                segment_rel_offset=segment_rel_offset.tolist(),
+                segment_rel=segment_rel.tolist(),
+                sample_ids=sample_ids.tolist(),
+                num_segments=num_segments.tolist(),
+                image_bound=image_bound,
+                **kwargs,
+            ),
+            predict_segments,
+            answer_placeholders,
+            table_states["ext_table"],
+            table_states["token_id_table"],
+        )
+    def _batch_tokenize_cpmbee(self, data_lst, *args, **kwargs):
+        """
+        Batched _token_cpmbee.
+        """
+        device = kwargs.get("device", "cpu")
+        return_tensors = kwargs.get("return_tensors", None)
+        batch_outputs = {}
+        segment_rel_pack = []
+        other_info = []
+        batch_ext_table_map: Dict[Tuple[int, int], int] = {}
+        batch_ext_table_ids: List[int] = []
+        batch_ext_table_sub: List[int] = []
+        for data in data_lst:
+            self.ext_table = {}
+            self.ext_table_rev = {}
+            self.token_id_table = {}
+            (outputs, predict_segments, answer_placeholders, ext_table, token_id_table) = self._tokenize_cpmbee(
+                data,
+                truncation=None,
+                padding=PaddingStrategy.DO_NOT_PAD.value,
+                max_length=None,
+                pad_to_multiple_of=None,
+                return_attention_mask=False,
+                return_tensors=None,
+            )
+            rev_ext_table = {}
+            for token, mp in token_id_table.items():
+                if token == "<ans>":
+                    continue
+                token_id = self.encoder[token]
+                for fake_id, token_sub in mp.items():
+                    if token_sub > 0:
+                        if (token_id, token_sub) not in batch_ext_table_map:
+                            batch_ext_table_map[(token_id, token_sub)] = len(batch_ext_table_ids) + self.vocab_size
+                            batch_ext_table_ids.append(token_id)
+                            batch_ext_table_sub.append(token_sub)
+                        rev_ext_table[batch_ext_table_map[(token_id, token_sub)]] = ext_table[fake_id]
+                    else:
+                        rev_ext_table[token_id] = ext_table[fake_id]
+            segment_rel_pack.append(np.array(outputs.pop("segment_rel")))
+            other_info.append(
+                {
+                    "predict_segments": predict_segments,
+                    "answer_placeholders": answer_placeholders,
+                    "ext_table": rev_ext_table,
+                }
+            )
+            for key, value in outputs.items():
+                if key not in batch_outputs:
+                    batch_outputs[key] = []
+                batch_outputs[key].append(value)
+        max_length = max([len(item) for item in batch_outputs[self.model_input_names[0]]])
+        batch_size = len(batch_outputs[self.model_input_names[0]])
+        for i in range(batch_size):
+            inputs = {k: v[i] for k, v in batch_outputs.items()}
+            for k, v in inputs.items():
+                required_input = v
+                needs_to_be_padded = len(required_input) != max_length and k != 'image_bound'
+                if needs_to_be_padded:
+                    difference = max_length - len(required_input)
+                    batch_outputs[k][i] = [self.pad_token_id] * difference + required_input
+        max_num_rels = 0
+        for rel in segment_rel_pack:
+            max_num_rels = max(max_num_rels, rel.shape[0])
+        padded_rels = np.zeros((len(segment_rel_pack), max_num_rels), dtype=np.int32)
+        for i, rel in enumerate(segment_rel_pack):
+            padded_rels[i, : rel.shape[0]] = rel
+        batch_outputs["segment_rel"] = padded_rels
+        batch_outputs["batch_ext_table_ids"] = np.array(batch_ext_table_ids, dtype=np.int32)
+        batch_outputs["batch_ext_table_sub"] = np.array(batch_ext_table_sub, dtype=np.int32)
+        batch_outputs = BatchEncoding(batch_outputs, tensor_type=return_tensors)
+        if return_tensors == "pt":
+            batch_outputs = batch_outputs.to(device=device)
+        batch_outputs["other_info"] = other_info
+        return batch_outputs
+    def convert_data_to_id(
+        self,
+        data: Any,
+        prev_ext_states: Optional[_PrevExtTableStates] = None,
+        shuffle_answer: bool = True,
+        max_depth: int = 8,
+    ):
+        """
+        Parse a dict to data ids. Exclusive for CPMBee. It will
+        1. parse the dict to segments and get segment_rel, which for calculating of position_bias.
+        2. tokenize every segment.
+        """
+        root: _DictTree = {
+            "value": "<root>",
+            "children": [],
+            "depth": 0,
+            "segment_id": 0,
+            "need_predict": False,
+            "is_image": False
+        }
+        segments = [root]
+        def _build_dict_tree(data: CPMBeeInputType, depth: int, need_predict: bool, is_image: bool) -> List[_DictTree]:
+            if isinstance(data, dict):
+                ret_list: List[_DictTree] = []
+                curr_items = list(data.items())
+                if need_predict and shuffle_answer:
+                    access_idx = np.arange(len(curr_items))
+                    np.random.shuffle(access_idx)
+                    curr_items = [curr_items[idx] for idx in access_idx]
+                for k, v in curr_items:
+                    child_info: _DictTree = {
+                        "value": k,
+                        "children": [],
+                        "depth": depth,
+                        "segment_id": len(segments),
+                        "need_predict": False,  # only leaves are contexts
+                        "is_image": False,
+                    }
+                    segments.append(child_info)
+                    child_info["children"] = _build_dict_tree(
+                        v, depth + 1,
+                        need_predict=need_predict or (depth == 1 and k == "<ans>"),
+                        is_image=is_image or (depth == 1 and k == "image")
+                    )  # elements in <root>.<ans>
+                    ret_list.append(child_info)
+                return ret_list
+            else:
+                assert isinstance(data, str), "Invalid data {}".format(data)
+                ret: _DictTree = {
+                    "value": data,
+                    "children": [],
+                    "depth": depth,
+                    "segment_id": len(segments),
+                    "need_predict": need_predict,
+                    "is_image": is_image,
+                }
+                segments.append(ret)
+                return [ret]
+        root["children"] = _build_dict_tree(data, 1, False, False)
+        num_segments = len(segments)
+        segment_rel = np.zeros((num_segments * num_segments,), dtype=np.int32)
+        def _build_segment_rel(node: _DictTree) -> List[Tuple[int, int]]:
+            ret: List[Tuple[int, int]] = [(node["segment_id"], node["depth"])]
+            for child in node["children"]:
+                sub = _build_segment_rel(child)
+                for seg_id_1, depth_1 in sub:
+                    for seg_id_2, depth_2 in ret:
+                        n_up = min(depth_1 - node["depth"], max_depth - 1)
+                        n_down = min(depth_2 - node["depth"], max_depth - 1)
+                        segment_rel[seg_id_1 * num_segments + seg_id_2] = rel_to_bucket(
+                            n_up, n_down, max_depth=max_depth
+                        )
+                        segment_rel[seg_id_2 * num_segments + seg_id_1] = rel_to_bucket(
+                            n_down, n_up, max_depth=max_depth
+                        )
+                ret.extend(sub)
+            return ret
+        _build_segment_rel(root)
+        input_ids: List[int] = []
+        input_id_subs: List[int] = []
+        segment_bound: List[Tuple[int, int]] = []
+        image_bound: List[Tuple[int, int]] = []
+        if prev_ext_states is not None:
+            self.ext_table = prev_ext_states["ext_table"]
+            self.token_id_table = prev_ext_states["token_id_table"]
+        for seg in segments:
+            # tokenize
+            tokens = self.convert_tokens_to_ids(self.tokenize(seg["value"], for_cpmbee=True))
+            token_id_subs = []
+            reid_token_ids = []
+            for idx in tokens:
+                if idx in self.ext_table:
+                    # unk or special token
+                    token = self.ext_table[idx]
+                    if token.startswith("<") and token.endswith(">"):
+                        # special token
+                        if "_" in token:
+                            token_name = token[1:-1].split("_", maxsplit=1)[0]
+                        else:
+                            token_name = token[1:-1]
+                        token_name = "<{}>".format(token_name)
+                    else:
+                        token_name = "<unk>"
+                    if token_name not in self.token_id_table:
+                        self.token_id_table[token_name] = {}
+                    if idx not in self.token_id_table[token_name]:
+                        self.token_id_table[token_name][idx] = len(self.token_id_table[token_name])
+                    if token_name not in self.encoder:
+                        raise ValueError("Invalid token {}".format(token))
+                    reid_token_ids.append(self.encoder[token_name])
+                    token_id_subs.append(self.token_id_table[token_name][idx])
+                else:
+                    reid_token_ids.append(idx)
+                    token_id_subs.append(0)
+            tokens = [self.bos_token_id] + reid_token_ids
+            token_id_subs = [0] + token_id_subs
+            # eos_id 表示 no need_predict
+            if not seg["need_predict"]:  # eos
+                tokens = tokens + [self.eos_token_id]
+                token_id_subs = token_id_subs + [0]
+            else:
+                # no eos
+                pass
+            begin = len(input_ids)
+            input_ids.extend(tokens)
+            input_id_subs.extend(token_id_subs)
+            end = len(input_ids)
+            segment_bound.append((begin, end))
+        ids = np.array(input_ids, dtype=np.int32)
+        id_subs = np.array(input_id_subs, dtype=np.int32)
+        segs = np.zeros((ids.shape[0],), dtype=np.int32)  # 按segment_bound对seg编号
+        context = np.zeros((ids.shape[0],), dtype=np.int8)
+        for i, (begin, end) in enumerate(segment_bound):
+            if not segments[i]["need_predict"]:
+                context[begin:end] = 1
+            if segments[i]["is_image"]:
+                image_bound.append((begin + 1, end - 1))
+            segs[begin:end] = i
+        curr_ext_table_states: _PrevExtTableStates = {
+            "ext_table": self.ext_table,
+            "token_id_table": self.token_id_table,
+        }
+        image_bound = np.array(image_bound, dtype=np.int32)
+        return ids, id_subs, context, segs, segment_rel, num_segments, curr_ext_table_states, image_bound
+    def prepare_for_model(
+        self,
+        ids: List[int],
+        pair_ids: Optional[List[int]] = None,
+        add_special_tokens: bool = True,
+        padding: Union[bool, str, PaddingStrategy] = False,
+        truncation: Union[bool, str, TruncationStrategy] = None,
+        max_length: Optional[int] = None,
+        stride: int = 0,
+        pad_to_multiple_of: Optional[int] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        return_token_type_ids: Optional[bool] = None,
+        return_attention_mask: Optional[bool] = None,
+        return_overflowing_tokens: bool = False,
+        return_special_tokens_mask: bool = False,
+        return_length: bool = False,
+        verbose: bool = True,
+        prepend_batch_axis: bool = False,
+        **kwargs,
+    ) -> BatchEncoding:
+        """
+        Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by the model. It
+        adds special tokens, truncates sequences if overflowing while taking into account the special tokens and
+        manages a moving window (with user defined stride) for overflowing tokens. Please Note, for *pair_ids*
+        different than `None` and *truncation_strategy = longest_first* or `True`, it is not possible to return
+        overflowing tokens. Such a combination of arguments will raise an error.
+        Args:
+            ids (`List[int]`):
+                Tokenized input ids of the first sequence. Can be obtained from a string by chaining the `tokenize` and
+                `convert_tokens_to_ids` methods.
+            pair_ids (`List[int]`, *optional*):
+                Tokenized input ids of the second sequence. Can be obtained from a string by chaining the `tokenize`
+                and `convert_tokens_to_ids` methods.
+        """
+        # Backward compatibility for 'truncation_strategy', 'pad_to_max_length'
+        padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
+            padding=padding,
+            truncation=truncation,
+            max_length=max_length,
+            pad_to_multiple_of=pad_to_multiple_of,
+            verbose=verbose,
+            **kwargs,
+        )
+        pair = bool(pair_ids is not None)
+        len_ids = len(ids)
+        len_pair_ids = len(pair_ids) if pair else 0
+        if return_token_type_ids and not add_special_tokens:
+            raise ValueError(
+                "Asking to return token_type_ids while setting add_special_tokens to False "
+                "results in an undefined behavior. Please set add_special_tokens to True or "
+                "set return_token_type_ids to None."
+            )
+        if (
+            return_overflowing_tokens
+            and truncation_strategy == TruncationStrategy.LONGEST_FIRST
+            and pair_ids is not None
+        ):
+            raise ValueError(
+                "Not possible to return overflowing tokens for pair of sequences with the "
+                "`longest_first`. Please select another truncation strategy than `longest_first`, "
+                "for instance `only_second` or `only_first`."
+            )
+        # Load from model defaults
+        if return_token_type_ids is None:
+            return_token_type_ids = "token_type_ids" in self.model_input_names
+        if return_attention_mask is None:
+            return_attention_mask = "attention_mask" in self.model_input_names
+        encoded_inputs = {}
+        # Compute the total size of the returned encodings
+        total_len = len_ids + len_pair_ids + (self.num_special_tokens_to_add(pair=pair) if add_special_tokens else 0)
+        # Truncation: Handle max sequence length
+        overflowing_tokens = []
+        if truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE and max_length and total_len > max_length:
+            ids, pair_ids, overflowing_tokens = self.truncate_sequences(
+                ids,
+                pair_ids=pair_ids,
+                num_tokens_to_remove=total_len - max_length,
+                truncation_strategy=truncation_strategy,
+                stride=stride,
+            )
+        if return_overflowing_tokens:
+            encoded_inputs["overflowing_tokens"] = overflowing_tokens
+            encoded_inputs["num_truncated_tokens"] = total_len - max_length
+        # Add special tokens
+        if add_special_tokens:
+            sequence = self.build_inputs_with_special_tokens(ids, pair_ids)
+            token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids)
+        else:
+            sequence = ids + pair_ids if pair else ids
+            token_type_ids = [0] * len(ids) + ([0] * len(pair_ids) if pair else [])
+        # Build output dictionary
+        encoded_inputs["input_ids"] = sequence
+        if return_token_type_ids:
+            encoded_inputs["token_type_ids"] = token_type_ids
+        if return_special_tokens_mask:
+            if add_special_tokens:
+                encoded_inputs["special_tokens_mask"] = self.get_special_tokens_mask(ids, pair_ids)
+            else:
+                encoded_inputs["special_tokens_mask"] = [0] * len(sequence)
+        # Check lengths
+        self._eventual_warn_about_too_long_sequence(encoded_inputs["input_ids"], max_length, verbose)
+        # Padding
+        if padding_strategy != PaddingStrategy.DO_NOT_PAD or return_attention_mask:
+            encoded_inputs = self.pad(
+                encoded_inputs,
+                max_length=max_length,
+                padding=padding_strategy.value,
+                pad_to_multiple_of=pad_to_multiple_of,
+                return_attention_mask=return_attention_mask,
+            )
+        if return_length:
+            encoded_inputs["length"] = len(encoded_inputs["input_ids"])
+        # for CPMBee, encode all the model arguments
+        for arg in self.ext_args_for_model:
+            v = kwargs.get(arg, None)
+            if v is not None:
+                encoded_inputs[arg] = v
+        batch_outputs = BatchEncoding(
+            encoded_inputs, tensor_type=return_tensors, prepend_batch_axis=prepend_batch_axis
+        )
+        return batch_outputs
+    def prepare_for_finetune(
+        self,
+        data_list: List[Dict],
+        max_length: int = 2048
+    ):
+        _inputs: List[NDArray[np.int32]] = []
+        _inputs_sub: List[NDArray[np.int32]] = []
+        _context: List[NDArray[np.int8]] = []
+        _sample_ids: List[NDArray[np.int32]] = []
+        _segments: List[NDArray[np.int32]] = []
+        _num_segments: List[NDArray[np.int32]] = []
+        _segment_rel_offset: List[NDArray[np.int32]] = []
+        _segment_rel: List[NDArray[np.int32]] = []
+        _spans: List[List[int]] = []
+        _raw_data: List[List[Any]] = []
+        raw_data = {}
+        for data in data_list:
+            (
+                input_ids,
+                input_id_subs,
+                context,
+                segment_ids,
+                segment_rel,
+                n_segments,
+                _
+            ) = self.convert_data_to_id(data)
+            input_ids = input_ids[: max_length]
+            context = context[: max_length]
+            segment_ids = segment_ids[: max_length]
+            raw_data["input"] = data
+            raw_data["samples"] = []
+            sample_ids = np.zeros(input_ids.shape, dtype=np.int32)
+            segment_rel_offset = np.zeros(input_ids.shape, dtype=np.int32)
+            num_segments = np.full(input_ids.shape, n_segments, dtype=np.int32)
+            _inputs.append(input_ids)
+            _inputs_sub.append(input_id_subs)
+            _context.append(context)
+            _sample_ids.append(sample_ids)
+            _segments.append(segment_ids)
+            _num_segments.append(num_segments)
+            _segment_rel_offset.append(segment_rel_offset)
+            _segment_rel.append(segment_rel)
+            _spans.append([input_ids.shape[0]])
+            _raw_data.append([raw_data])
+        batch_size = len(_inputs)
+        inputs = np.zeros((batch_size, max_length), dtype=np.int32)
+        inputs_sub = np.zeros((batch_size, max_length), dtype=np.int32)
+        context = np.zeros((batch_size, max_length), dtype=np.int8)
+        sample_ids = np.zeros((batch_size, max_length), dtype=np.int32)
+        segments = np.zeros((batch_size, max_length), dtype=np.int32)
+        num_segments = np.zeros((batch_size, max_length), dtype=np.int32)
+        segment_rel_offset = np.zeros((batch_size, max_length), dtype=np.int32)
+        tgt = np.full((batch_size, max_length), -100, dtype=np.int32)
+        max_rel = 0
+        for i in range(batch_size):
+            max_rel = max(max_rel, _segment_rel[i].shape[0])
+        segment_rel = np.zeros((batch_size, max_rel), dtype=np.int32)
+        spans = np.zeros((batch_size, max_length), dtype=np.int32)
+        length = np.zeros((batch_size,), dtype=np.int32)
+        batch_ext_table_map: Dict[Tuple[int, int], int] = {}
+        batch_ext_table_ids: List[int] = []
+        batch_ext_table_sub: List[int] = []
+        raw_data_list: List[Any] = []
+        for i in range(batch_size):
+            instance_length = _inputs[i].shape[0]
+            rel_size = _segment_rel[i].shape[0]
+            inputs[i, :instance_length] = _inputs[i]
+            inputs_sub[i, :instance_length] = _inputs_sub[i]
+            context[i, :instance_length] = _context[i]
+            sample_ids[i, :instance_length] = _sample_ids[i]
+            segments[i, :instance_length] = _segments[i]
+            num_segments[i, :instance_length] = _num_segments[i]
+            segment_rel_offset[i, :instance_length] = _segment_rel_offset[i]
+            segment_rel[i, :rel_size] = _segment_rel[i]
+            span_begin = 0
+            for span_id, span_end in enumerate(_spans[i]):
+                spans[i, span_begin:span_end] = span_id
+                span_begin = span_end
+            length[i] = instance_length
+            raw_data_list.extend(_raw_data[i])
+            for j in range(instance_length):
+                idx, idx_sub = _inputs[i][j], _inputs_sub[i][j]
+                tgt_idx = idx
+                if idx_sub > 0:
+                    # need to be in ext table
+                    if (idx, idx_sub) not in batch_ext_table_map:
+                        batch_ext_table_map[(idx, idx_sub)] = len(batch_ext_table_map)
+                        batch_ext_table_ids.append(idx)
+                        batch_ext_table_sub.append(idx_sub)
+                    tgt_idx = batch_ext_table_map[(idx, idx_sub)] + self.vocab_size
+                if j > 1 and context[i, j - 1] == 0:
+                    if idx != self.bos_token_id:
+                        tgt[i, j - 1] = tgt_idx
+                    else:
+                        tgt[i, j - 1] = self.eos_token_id
+            if context[i, instance_length - 1] == 0:
+                tgt[i, instance_length - 1] = self.eos_token_id
+        if len(batch_ext_table_map) == 0:
+            # placeholder
+            batch_ext_table_ids.append(0)
+            batch_ext_table_sub.append(1)
+        return BatchEncoding({
+            "input_ids": inputs,
+            "input_id_sub": inputs_sub,
+            "length": length,
+            "context": context > 0,
+            "sample_ids": sample_ids,
+            "num_segments": num_segments,
+            "segment": segments,
+            "segment_rel_offset": segment_rel_offset,
+            "segment_rel": segment_rel,
+            "span": spans,
+            "labels": tgt,
+            "ext_table_ids": np.array(batch_ext_table_ids, dtype=np.int32),
+            "ext_table_sub": np.array(batch_ext_table_sub, dtype=np.int32)
+        }, tensor_type="pt")

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+    "name_or_path": "openbmb/viscpmchat-bee-10b",
+    "tokenizer_class": "VisCpmChatBeeTokenizer",
+    "auto_map": {
+        "AutoTokenizer": [
+            "tokenization_viscpmchatbee.VisCpmChatBeeTokenizer",
+            null
+        ]
+    }
+}

utils.py ADDED Viewed

	@@ -0,0 +1,730 @@

+import cv2
+import numpy as np
+import torch
+from timm.data.constants import IMAGENET_INCEPTION_MEAN, IMAGENET_INCEPTION_STD
+from timm.data.transforms import RandomResizedCropAndInterpolation
+from torchvision import transforms
+import urllib
+from tqdm import tqdm
+from cpm_live.tokenizers import CPMBeeTokenizer
+from torch.utils.data import default_collate
+from typing import Any, Callable, Dict, List, Optional, Tuple, Union
+from typing_extensions import TypedDict
+from numpy.typing import NDArray
+import importlib.machinery
+import importlib.util
+import types
+import random
+CPMBeeInputType = Union[str, Dict[str, "CPMBeeInputType"]]
+def pad(orig_items, key, max_length=None, padding_value=0, padding_side="left"):
+    items = []
+    if isinstance(orig_items[0][key], list):
+        assert isinstance(orig_items[0][key][0], torch.Tensor)
+        for it in orig_items:
+            for tr in it[key]:
+                items.append({key: tr})
+    else:
+        assert isinstance(orig_items[0][key], torch.Tensor)
+        items = orig_items
+    batch_size = len(items)
+    shape = items[0][key].shape
+    dim = len(shape)
+    assert dim <= 3
+    if max_length is None:
+        max_length = 0
+    max_length = max(max_length, max(item[key].shape[-1] for item in items))
+    min_length = min(item[key].shape[-1] for item in items)
+    dtype = items[0][key].dtype
+    if dim == 1:
+        return torch.cat([item[key] for item in items], dim=0)
+    elif dim == 2:
+        if max_length == min_length:
+            return torch.cat([item[key] for item in items], dim=0)
+        tensor = torch.zeros((batch_size, max_length), dtype=dtype) + padding_value
+    else:
+        tensor = torch.zeros((batch_size, max_length, shape[-1]), dtype=dtype) + padding_value
+    for i, item in enumerate(items):
+        if dim == 2:
+            if padding_side == "left":
+                tensor[i, -len(item[key][0]):] = item[key][0].clone()
+            else:
+                tensor[i, : len(item[key][0])] = item[key][0].clone()
+        elif dim == 3:
+            if padding_side == "left":
+                tensor[i, -len(item[key][0]):, :] = item[key][0].clone()
+            else:
+                tensor[i, : len(item[key][0]), :] = item[key][0].clone()
+    return tensor
+class CPMBeeCollater:
+    """
+    针对 cpmbee 输入数据 collate, 对应 cpm-live 的 _MixedDatasetBatchPacker
+    目前利用 torch 的原生 Dataloader 不太适合改造 in-context-learning
+    并且原来实现为了最大化提高有效 token 比比例, 会有一个 best_fit 操作, 这个目前也不支持
+    todo: @wangchongyi 重写一下 Dataloader or BatchPacker
+    """
+    def __init__(self, tokenizer: CPMBeeTokenizer, max_len):
+        self.tokenizer = tokenizer
+        self._max_length = max_len
+        self.pad_keys = ['input_ids', 'input_id_subs', 'context', 'segment_ids', 'segment_rel_offset',
+                         'segment_rel', 'sample_ids', 'num_segments']
+    def __call__(self, batch):
+        batch_size = len(batch)
+        tgt = np.full((batch_size, self._max_length), -100, dtype=np.int32)
+        # 目前没有 best_fit, span 为全 0
+        span = np.zeros((batch_size, self._max_length), dtype=np.int32)
+        length = np.zeros((batch_size,), dtype=np.int32)
+        batch_ext_table_map: Dict[Tuple[int, int], int] = {}
+        batch_ext_table_ids: List[int] = []
+        batch_ext_table_sub: List[int] = []
+        raw_data_list: List[Any] = []
+        for i in range(batch_size):
+            instance_length = batch[i]['input_ids'][0].shape[0]
+            length[i] = instance_length
+            raw_data_list.extend(batch[i]['raw_data'])
+            for j in range(instance_length):
+                idx, idx_sub = batch[i]['input_ids'][0, j], batch[i]['input_id_subs'][0, j]
+                tgt_idx = idx
+                if idx_sub > 0:
+                    # need to be in ext table
+                    if (idx, idx_sub) not in batch_ext_table_map:
+                        batch_ext_table_map[(idx, idx_sub)] = len(batch_ext_table_map)
+                        batch_ext_table_ids.append(idx)
+                        batch_ext_table_sub.append(idx_sub)
+                    tgt_idx = batch_ext_table_map[(idx, idx_sub)] + self.tokenizer.vocab_size
+                if j > 1 and batch[i]['context'][0, j - 1] == 0:
+                    if idx != self.tokenizer.bos_id:
+                        tgt[i, j - 1] = tgt_idx
+                    else:
+                        tgt[i, j - 1] = self.tokenizer.eos_id
+            if batch[i]['context'][0, instance_length - 1] == 0:
+                tgt[i, instance_length - 1] = self.tokenizer.eos_id
+        if len(batch_ext_table_map) == 0:
+            # placeholder
+            batch_ext_table_ids.append(0)
+            batch_ext_table_sub.append(1)
+        # image
+        if 'pixel_values' in batch[0]:
+            data = {'pixel_values': default_collate([i['pixel_values'] for i in batch])}
+        else:
+            data = {}
+        # image_bound
+        if 'image_bound' in batch[0]:
+            data['image_bound'] = default_collate([i['image_bound'] for i in batch])
+        # bee inp
+        for key in self.pad_keys:
+            data[key] = pad(batch, key, max_length=self._max_length, padding_value=0, padding_side='right')
+        data['context'] = data['context'] > 0
+        data['length'] = torch.from_numpy(length)
+        data['span'] = torch.from_numpy(span)
+        data['target'] = torch.from_numpy(tgt)
+        data['ext_table_ids'] = torch.from_numpy(np.array(batch_ext_table_ids))
+        data['ext_table_sub'] = torch.from_numpy(np.array(batch_ext_table_sub))
+        data['raw_data'] = raw_data_list
+        return data
+class _DictTree(TypedDict):
+    value: str
+    children: List["_DictTree"]
+    depth: int
+    segment_id: int
+    need_predict: bool
+    is_image: bool
+class _PrevExtTableStates(TypedDict):
+    ext_table: Dict[int, str]
+    token_id_table: Dict[str, Dict[int, int]]
+class _TransformFuncDict(TypedDict):
+    loader: importlib.machinery.SourceFileLoader
+    module: types.ModuleType
+    last_m: float
+_TransformFunction = Callable[[CPMBeeInputType, int, random.Random], CPMBeeInputType]
+class CPMBeeBatch(TypedDict):
+    inputs: NDArray[np.int32]
+    inputs_sub: NDArray[np.int32]
+    length: NDArray[np.int32]
+    context: NDArray[np.bool_]
+    sample_ids: NDArray[np.int32]
+    num_segments: NDArray[np.int32]
+    segment_ids: NDArray[np.int32]
+    segment_rel_offset: NDArray[np.int32]
+    segment_rel: NDArray[np.int32]
+    spans: NDArray[np.int32]
+    target: NDArray[np.int32]
+    ext_ids: NDArray[np.int32]
+    ext_sub: NDArray[np.int32]
+    task_ids: NDArray[np.int32]
+    task_names: List[str]
+    raw_data: List[Any]
+def rel_to_bucket(n_up: int, n_down: int, max_depth: int = 8):
+    ret = n_up * max_depth + n_down
+    if ret == 0:
+        return ret
+    else:
+        # bucket 1 is reserved for incontext samples
+        return ret + 1
+def convert_data_to_id(
+    tokenizer: CPMBeeTokenizer,
+    data: Any,
+    prev_ext_states: Optional[_PrevExtTableStates] = None,
+    shuffle_answer: bool = True,
+    max_depth: int = 8
+):
+    root: _DictTree = {
+        "value": "<root>",
+        "children": [],
+        "depth": 0,
+        "segment_id": 0,
+        "need_predict": False,
+        "is_image": False
+    }
+    segments = [root]
+    def _build_dict_tree(data: CPMBeeInputType, depth: int, need_predict: bool, is_image: bool) -> List[_DictTree]:
+        if isinstance(data, dict):
+            ret_list: List[_DictTree] = []
+            curr_items = list(data.items())
+            if need_predict and shuffle_answer:
+                access_idx = np.arange(len(curr_items))
+                np.random.shuffle(access_idx)
+                curr_items = [curr_items[idx] for idx in access_idx]
+            for k, v in curr_items:
+                child_info: _DictTree = {
+                    "value": k,
+                    "children": [],
+                    "depth": depth,
+                    "segment_id": len(segments),
+                    "need_predict": False,  # only leaves are contexts
+                    "is_image": False,
+                }
+                segments.append(child_info)
+                child_info["children"] = _build_dict_tree(
+                    v, depth + 1,
+                    need_predict=need_predict or (depth == 1 and k == "<ans>"),
+                    is_image=is_image or (depth == 1 and k == "image")
+                )  # elements in <root>.<ans>
+                ret_list.append(child_info)
+            return ret_list
+        else:
+            assert isinstance(data, str), "Invalid data {}".format(data)
+            ret: _DictTree = {
+                "value": data,
+                "children": [],
+                "depth": depth,
+                "segment_id": len(segments),
+                "need_predict": need_predict,
+                "is_image": is_image,
+            }
+            segments.append(ret)
+            return [ret]
+    root["children"] = _build_dict_tree(data, 1, False, False)
+    num_segments = len(segments)
+    segment_rel = np.zeros((num_segments * num_segments,), dtype=np.int32)
+    def _build_segment_rel(node: _DictTree) -> List[Tuple[int, int]]:
+        ret: List[Tuple[int, int]] = [(node["segment_id"], node["depth"])]
+        for child in node["children"]:
+            sub = _build_segment_rel(child)
+            for seg_id_1, depth_1 in sub:
+                for seg_id_2, depth_2 in ret:
+                    n_up = min(depth_1 - node["depth"], max_depth - 1)
+                    n_down = min(depth_2 - node["depth"], max_depth - 1)
+                    segment_rel[seg_id_1 * num_segments + seg_id_2] = rel_to_bucket(
+                        n_up, n_down, max_depth=max_depth
+                    )
+                    segment_rel[seg_id_2 * num_segments + seg_id_1] = rel_to_bucket(
+                        n_down, n_up, max_depth=max_depth
+                    )
+            ret.extend(sub)
+        return ret
+    _build_segment_rel(root)
+    input_ids: List[int] = []
+    input_id_subs: List[int] = []
+    segment_bound: List[Tuple[int, int]] = []
+    image_bound: List[Tuple[int, int]] = []
+    ext_table: Dict[int, str] = {}
+    token_id_table: Dict[str, Dict[int, int]] = {}
+    if prev_ext_states is not None:
+        ext_table = prev_ext_states["ext_table"]
+        token_id_table = prev_ext_states["token_id_table"]
+    for seg in segments:
+        tokens, ext_table = tokenizer.encode(seg["value"], ext_table)
+        token_id_subs = []
+        reid_token_ids = []
+        for idx in tokens:
+            if idx in ext_table:
+                # unk or special token
+                token = ext_table[idx]
+                if token.startswith("<") and token.endswith(">"):
+                    # special token
+                    if "_" in token:
+                        token_name = token[1:-1].split("_", maxsplit=1)[0]
+                    else:
+                        token_name = token[1:-1]
+                    token_name = "<{}>".format(token_name)
+                else:
+                    token_name = "<unk>"
+                if token_name not in token_id_table:
+                    token_id_table[token_name] = {}
+                if idx not in token_id_table[token_name]:
+                    token_id_table[token_name][idx] = len(token_id_table[token_name])
+                if token_name not in tokenizer.encoder:
+                    raise ValueError("Invalid token {}".format(token))
+                reid_token_ids.append(tokenizer.encoder[token_name])
+                token_id_subs.append(token_id_table[token_name][idx])
+            else:
+                reid_token_ids.append(idx)
+                token_id_subs.append(0)
+        tokens = [tokenizer.bos_id] + reid_token_ids
+        token_id_subs = [0] + token_id_subs
+        if not seg["need_predict"]:
+            tokens = tokens + [tokenizer.eos_id]
+            token_id_subs = token_id_subs + [0]
+        else:
+            # no eos
+            pass
+        begin = len(input_ids)
+        input_ids.extend(tokens)
+        input_id_subs.extend(token_id_subs)
+        end = len(input_ids)
+        segment_bound.append((begin, end))
+    ids = np.array(input_ids, dtype=np.int32)
+    id_subs = np.array(input_id_subs, dtype=np.int32)
+    segs = np.zeros((ids.shape[0],), dtype=np.int32)
+    context = np.zeros((ids.shape[0],), dtype=np.int8)
+    for i, (begin, end) in enumerate(segment_bound):
+        if not segments[i]["need_predict"]:
+            context[begin:end] = 1
+        if segments[i]["is_image"]:
+            image_bound.append((begin+1, end-1))
+        segs[begin:end] = i
+    curr_ext_table_states: _PrevExtTableStates = {
+        "ext_table": ext_table,
+        "token_id_table": token_id_table,
+    }
+    image_bound = np.array(image_bound, dtype=np.int32)
+    return ids, id_subs, context, segs, segment_rel, num_segments, curr_ext_table_states, image_bound
+# aug functions
+def identity_func(img):
+    return img
+def autocontrast_func(img, cutoff=0):
+    '''
+        same output as PIL.ImageOps.autocontrast
+    '''
+    n_bins = 256
+    def tune_channel(ch):
+        n = ch.size
+        cut = cutoff * n // 100
+        if cut == 0:
+            high, low = ch.max(), ch.min()
+        else:
+            hist = cv2.calcHist([ch], [0], None, [n_bins], [0, n_bins])
+            low = np.argwhere(np.cumsum(hist) > cut)
+            low = 0 if low.shape[0] == 0 else low[0]
+            high = np.argwhere(np.cumsum(hist[::-1]) > cut)
+            high = n_bins - 1 if high.shape[0] == 0 else n_bins - 1 - high[0]
+        if high <= low:
+            table = np.arange(n_bins)
+        else:
+            scale = (n_bins - 1) / (high - low)
+            table = np.arange(n_bins) * scale - low * scale
+            table[table < 0] = 0
+            table[table > n_bins - 1] = n_bins - 1
+        table = table.clip(0, 255).astype(np.uint8)
+        return table[ch]
+    channels = [tune_channel(ch) for ch in cv2.split(img)]
+    out = cv2.merge(channels)
+    return out
+def equalize_func(img):
+    '''
+        same output as PIL.ImageOps.equalize
+        PIL's implementation is different from cv2.equalize
+    '''
+    n_bins = 256
+    def tune_channel(ch):
+        hist = cv2.calcHist([ch], [0], None, [n_bins], [0, n_bins])
+        non_zero_hist = hist[hist != 0].reshape(-1)
+        step = np.sum(non_zero_hist[:-1]) // (n_bins - 1)
+        if step == 0:
+            return ch
+        n = np.empty_like(hist)
+        n[0] = step // 2
+        n[1:] = hist[:-1]
+        table = (np.cumsum(n) // step).clip(0, 255).astype(np.uint8)
+        return table[ch]
+    channels = [tune_channel(ch) for ch in cv2.split(img)]
+    out = cv2.merge(channels)
+    return out
+def rotate_func(img, degree, fill=(0, 0, 0)):
+    '''
+    like PIL, rotate by degree, not radians
+    '''
+    H, W = img.shape[0], img.shape[1]
+    center = W / 2, H / 2
+    M = cv2.getRotationMatrix2D(center, degree, 1)
+    out = cv2.warpAffine(img, M, (W, H), borderValue=fill)
+    return out
+def solarize_func(img, thresh=128):
+    '''
+        same output as PIL.ImageOps.posterize
+    '''
+    table = np.array([el if el < thresh else 255 - el for el in range(256)])
+    table = table.clip(0, 255).astype(np.uint8)
+    out = table[img]
+    return out
+def color_func(img, factor):
+    '''
+        same output as PIL.ImageEnhance.Color
+    '''
+    # implementation according to PIL definition, quite slow
+    #  degenerate = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)[:, :, np.newaxis]
+    #  out = blend(degenerate, img, factor)
+    #  M = (
+    #      np.eye(3) * factor
+    #      + np.float32([0.114, 0.587, 0.299]).reshape(3, 1) * (1. - factor)
+    #  )[np.newaxis, np.newaxis, :]
+    M = (
+        np.float32([
+            [0.886, -0.114, -0.114],
+            [-0.587, 0.413, -0.587],
+            [-0.299, -0.299, 0.701]]) * factor
+        + np.float32([[0.114], [0.587], [0.299]])
+    )
+    out = np.matmul(img, M).clip(0, 255).astype(np.uint8)
+    return out
+def contrast_func(img, factor):
+    """
+        same output as PIL.ImageEnhance.Contrast
+    """
+    mean = np.sum(np.mean(img, axis=(0, 1)) * np.array([0.114, 0.587, 0.299]))
+    table = np.array([(
+        el - mean) * factor + mean
+        for el in range(256)
+    ]).clip(0, 255).astype(np.uint8)
+    out = table[img]
+    return out
+def brightness_func(img, factor):
+    '''
+        same output as PIL.ImageEnhance.Contrast
+    '''
+    table = (np.arange(256, dtype=np.float32) * factor).clip(0, 255).astype(np.uint8)
+    out = table[img]
+    return out
+def sharpness_func(img, factor):
+    '''
+    The differences the this result and PIL are all on the 4 boundaries, the center
+    areas are same
+    '''
+    kernel = np.ones((3, 3), dtype=np.float32)
+    kernel[1][1] = 5
+    kernel /= 13
+    degenerate = cv2.filter2D(img, -1, kernel)
+    if factor == 0.0:
+        out = degenerate
+    elif factor == 1.0:
+        out = img
+    else:
+        out = img.astype(np.float32)
+        degenerate = degenerate.astype(np.float32)[1:-1, 1:-1, :]
+        out[1:-1, 1:-1, :] = degenerate + factor * (out[1:-1, 1:-1, :] - degenerate)
+        out = out.astype(np.uint8)
+    return out
+def shear_x_func(img, factor, fill=(0, 0, 0)):
+    H, W = img.shape[0], img.shape[1]
+    M = np.float32([[1, factor, 0], [0, 1, 0]])
+    out = cv2.warpAffine(img, M, (W, H), borderValue=fill, flags=cv2.INTER_LINEAR).astype(np.uint8)
+    return out
+def translate_x_func(img, offset, fill=(0, 0, 0)):
+    '''
+        same output as PIL.Image.transform
+    '''
+    H, W = img.shape[0], img.shape[1]
+    M = np.float32([[1, 0, -offset], [0, 1, 0]])
+    out = cv2.warpAffine(img, M, (W, H), borderValue=fill, flags=cv2.INTER_LINEAR).astype(np.uint8)
+    return out
+def translate_y_func(img, offset, fill=(0, 0, 0)):
+    '''
+        same output as PIL.Image.transform
+    '''
+    H, W = img.shape[0], img.shape[1]
+    M = np.float32([[1, 0, 0], [0, 1, -offset]])
+    out = cv2.warpAffine(img, M, (W, H), borderValue=fill, flags=cv2.INTER_LINEAR).astype(np.uint8)
+    return out
+def posterize_func(img, bits):
+    '''
+        same output as PIL.ImageOps.posterize
+    '''
+    out = np.bitwise_and(img, np.uint8(255 << (8 - bits)))
+    return out
+def shear_y_func(img, factor, fill=(0, 0, 0)):
+    H, W = img.shape[0], img.shape[1]
+    M = np.float32([[1, 0, 0], [factor, 1, 0]])
+    out = cv2.warpAffine(img, M, (W, H), borderValue=fill, flags=cv2.INTER_LINEAR).astype(np.uint8)
+    return out
+def cutout_func(img, pad_size, replace=(0, 0, 0)):
+    replace = np.array(replace, dtype=np.uint8)
+    H, W = img.shape[0], img.shape[1]
+    rh, rw = np.random.random(2)
+    pad_size = pad_size // 2
+    ch, cw = int(rh * H), int(rw * W)
+    x1, x2 = max(ch - pad_size, 0), min(ch + pad_size, H)
+    y1, y2 = max(cw - pad_size, 0), min(cw + pad_size, W)
+    out = img.copy()
+    out[x1:x2, y1:y2, :] = replace
+    return out
+# level to args
+def enhance_level_to_args(MAX_LEVEL):
+    def level_to_args(level):
+        return ((level / MAX_LEVEL) * 1.8 + 0.1,)
+    return level_to_args
+def shear_level_to_args(MAX_LEVEL, replace_value):
+    def level_to_args(level):
+        level = (level / MAX_LEVEL) * 0.3
+        if np.random.random() > 0.5:
+            level = -level
+        return (level, replace_value)
+    return level_to_args
+def translate_level_to_args(translate_const, MAX_LEVEL, replace_value):
+    def level_to_args(level):
+        level = (level / MAX_LEVEL) * float(translate_const)
+        if np.random.random() > 0.5:
+            level = -level
+        return (level, replace_value)
+    return level_to_args
+def cutout_level_to_args(cutout_const, MAX_LEVEL, replace_value):
+    def level_to_args(level):
+        level = int((level / MAX_LEVEL) * cutout_const)
+        return (level, replace_value)
+    return level_to_args
+def solarize_level_to_args(MAX_LEVEL):
+    def level_to_args(level):
+        level = int((level / MAX_LEVEL) * 256)
+        return (level, )
+    return level_to_args
+def none_level_to_args(level):
+    return ()
+def posterize_level_to_args(MAX_LEVEL):
+    def level_to_args(level):
+        level = int((level / MAX_LEVEL) * 4)
+        return (level, )
+    return level_to_args
+def rotate_level_to_args(MAX_LEVEL, replace_value):
+    def level_to_args(level):
+        level = (level / MAX_LEVEL) * 30
+        if np.random.random() < 0.5:
+            level = -level
+        return (level, replace_value)
+    return level_to_args
+func_dict = {
+    'Identity': identity_func,
+    'AutoContrast': autocontrast_func,
+    'Equalize': equalize_func,
+    'Rotate': rotate_func,
+    'Solarize': solarize_func,
+    'Color': color_func,
+    'Contrast': contrast_func,
+    'Brightness': brightness_func,
+    'Sharpness': sharpness_func,
+    'ShearX': shear_x_func,
+    'TranslateX': translate_x_func,
+    'TranslateY': translate_y_func,
+    'Posterize': posterize_func,
+    'ShearY': shear_y_func,
+}
+translate_const = 10
+MAX_LEVEL = 10
+replace_value = (128, 128, 128)
+arg_dict = {
+    'Identity': none_level_to_args,
+    'AutoContrast': none_level_to_args,
+    'Equalize': none_level_to_args,
+    'Rotate': rotate_level_to_args(MAX_LEVEL, replace_value),
+    'Solarize': solarize_level_to_args(MAX_LEVEL),
+    'Color': enhance_level_to_args(MAX_LEVEL),
+    'Contrast': enhance_level_to_args(MAX_LEVEL),
+    'Brightness': enhance_level_to_args(MAX_LEVEL),
+    'Sharpness': enhance_level_to_args(MAX_LEVEL),
+    'ShearX': shear_level_to_args(MAX_LEVEL, replace_value),
+    'TranslateX': translate_level_to_args(
+        translate_const, MAX_LEVEL, replace_value
+    ),
+    'TranslateY': translate_level_to_args(
+        translate_const, MAX_LEVEL, replace_value
+    ),
+    'Posterize': posterize_level_to_args(MAX_LEVEL),
+    'ShearY': shear_level_to_args(MAX_LEVEL, replace_value),
+}
+class RandomAugment(object):
+    def __init__(self, N=2, M=10, isPIL=False, augs=[]):
+        self.N = N
+        self.M = M
+        self.isPIL = isPIL
+        if augs:
+            self.augs = augs
+        else:
+            self.augs = list(arg_dict.keys())
+    def get_random_ops(self):
+        sampled_ops = np.random.choice(self.augs, self.N)
+        return [(op, 0.5, self.M) for op in sampled_ops]
+    def __call__(self, img):
+        if self.isPIL:
+            img = np.array(img)
+        ops = self.get_random_ops()
+        for name, prob, level in ops:
+            if np.random.random() > prob:
+                continue
+            args = arg_dict[name](level)
+            img = func_dict[name](img, *args)
+        return img
+def build_transform(is_train, randaug=True, input_size=224, interpolation='bicubic'):
+    if is_train:
+        t = [
+            RandomResizedCropAndInterpolation(
+                input_size, scale=(0.5, 1.0), interpolation=transforms.InterpolationMode.BICUBIC),
+            transforms.RandomHorizontalFlip(),
+        ]
+        if randaug:
+            t.append(
+                RandomAugment(
+                    2, 7, isPIL=True,
+                    augs=[
+                        'Identity', 'AutoContrast', 'Equalize', 'Brightness', 'Sharpness',
+                        'ShearX', 'ShearY', 'TranslateX', 'TranslateY', 'Rotate',
+                    ]))
+        t += [
+            transforms.ToTensor(),
+            transforms.Normalize(mean=IMAGENET_INCEPTION_MEAN, std=IMAGENET_INCEPTION_STD),
+        ]
+        t = transforms.Compose(t)
+    else:
+        t = transforms.Compose([
+            transforms.Resize((input_size, input_size),
+                              interpolation=transforms.InterpolationMode.BICUBIC),
+            transforms.ToTensor(),
+            transforms.Normalize(mean=IMAGENET_INCEPTION_MEAN, std=IMAGENET_INCEPTION_STD)
+        ])
+    return t
+def _urlretrieve(url: str, filename: str, chunk_size: int = 1024) -> None:
+    with open(filename, "wb") as fh:
+        with urllib.request.urlopen(
+            urllib.request.Request(url, headers={"User-Agent": "vissl"})
+        ) as response:
+            with tqdm(total=response.length) as pbar:
+                for chunk in iter(lambda: response.read(chunk_size), ""):
+                    if not chunk:
+                        break
+                    pbar.update(chunk_size)
+                    fh.write(chunk)

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff