weifeng commited on
Commit
f41b749
1 Parent(s): b485c27
Files changed (2) hide show
  1. README.md +7 -7
  2. pytorch_model.bin +1 -1
README.md CHANGED
@@ -28,13 +28,13 @@ The first open source Chinese CLIP, pre-training on 123M image-text pairs, the t
28
 
29
  | 需求 Demand | 任务 Task | 系列 Series | 模型 Model | 参数 Parameter | 额外 Extra |
30
  | :----: | :----: | :----: | :----: | :----: | :----: |
31
- | 特殊 Special | 多模态 Multimodal | 太乙 Taiyi | CLIP (Roberta) | 102M | Chinese |
32
 
33
  ## 模型信息 Model Information
34
 
35
- 我们遵循CLIP的实验设置,以获得强大的视觉-语言表征。在训练中文版的CLIP时,我们使用[chinese-roberta-wwm](https://huggingface.co/hfl/chinese-roberta-wwm-ext)作为语言的编码器,并将[open_clip](https://github.com/mlfoundations/open_clip)中的**ViT-H-14**应用于视觉的编码器。为了快速且稳定地进行预训练,我们冻结了视觉编码器并且只微调语言编码器。此外,我们将[Noah-Wukong](https://wukong-dataset.github.io/wukong-dataset/)数据集(100M)和[Zero](https://zero.so.com/)数据集(23M)用作预训练的数据集。在悟空数据集和zero数据集上预训练24轮。据我们所知,我们的Taiyi-CLIP是目前Huggingface社区中首个的开源中文CLIP。
36
 
37
- We follow the experimental setup of CLIP to obtain powerful visual-language intelligence. To obtain the CLIP for Chinese, we employ [chinese-roberta-wwm](https://huggingface.co/hfl/chinese-roberta-wwm-ext) for the language encoder, and apply the **ViT-H-14** in [open_clip](https://github.com/mlfoundations/open_clip) for the vision encoder. We freeze the vision encoder and tune the language encoder to speed up and stabilize the pre-training process. Moreover, we apply [Noah-Wukong](https://wukong-dataset.github.io/wukong-dataset/) dataset (100M) and [Zero](https://zero.so.com/) dataset (23M) as the pre-training datasets. The model was first trained 24 epochs on wukong and zero. To the best of our knowledge, our TaiyiCLIP is currently the only open-sourced Chinese CLIP in the huggingface community.
38
 
39
  ### 下游效果 Performance
40
 
@@ -42,15 +42,15 @@ We follow the experimental setup of CLIP to obtain powerful visual-language inte
42
 
43
  | model | dataset | Top1 | Top5 |
44
  | ---- | ---- | ---- | ---- |
45
- | Taiyi-CLIP-RoBERTa-102M-ViT-L-Chinese | ImageNet1k-CN | 54.35% | 80.64% |
46
 
47
  **Zero-Shot Text-to-Image Retrieval**
48
 
49
  | model | dataset | Top1 | Top5 | Top10 |
50
  | ---- | ---- | ---- | ---- | ---- |
51
- | Taiyi-CLIP-RoBERTa-102M-ViT-L-Chinese | Flickr30k-CNA-test | 60.82% | 85.00% | 91.04% |
52
- | Taiyi-CLIP-RoBERTa-102M-ViT-L-Chinese | COCO-CN-test | 60.02% | 83.95% | 93.26% |
53
- | Taiyi-CLIP-RoBERTa-102M-ViT-L-Chinese | wukong50k | 66.85% | 92.81% | 96.69% |
54
 
55
  ## 使用 Usage
56
 
28
 
29
  | 需求 Demand | 任务 Task | 系列 Series | 模型 Model | 参数 Parameter | 额外 Extra |
30
  | :----: | :----: | :----: | :----: | :----: | :----: |
31
+ | 特殊 Special | 多模态 Multimodal | 太乙 Taiyi | CLIP (RoBERTa) | 102M | Chinese |
32
 
33
  ## 模型信息 Model Information
34
 
35
+ 我们遵循CLIP的实验设置,以获得强大的视觉-语言表征。在训练中文版的CLIP时,我们使用[chinese-roberta-wwm](https://huggingface.co/hfl/chinese-roberta-wwm-ext)作为语言的编码器,并将[open_clip](https://github.com/mlfoundations/open_clip)中的**ViT-L-14**应用于视觉的编码器。为了快速且稳定地进行预训练,我们冻结了视觉编码器并且只微调语言编码器。此外,我们将[Noah-Wukong](https://wukong-dataset.github.io/wukong-dataset/)数据集(100M)和[Zero](https://zero.so.com/)数据集(23M)用作预训练的数据集。在悟空数据集和zero数据集上预训练24轮。据我们所知,我们的Taiyi-CLIP是目前Huggingface社区中首个的开源中文CLIP。
36
 
37
+ We follow the experimental setup of CLIP to obtain powerful visual-language intelligence. To obtain the CLIP for Chinese, we employ [chinese-roberta-wwm](https://huggingface.co/hfl/chinese-roberta-wwm-ext) for the language encoder, and apply the **ViT-L-14** in [open_clip](https://github.com/mlfoundations/open_clip) for the vision encoder. We freeze the vision encoder and tune the language encoder to speed up and stabilize the pre-training process. Moreover, we apply [Noah-Wukong](https://wukong-dataset.github.io/wukong-dataset/) dataset (100M) and [Zero](https://zero.so.com/) dataset (23M) as the pre-training datasets. The model was first trained 24 epochs on wukong and zero. To the best of our knowledge, our TaiyiCLIP is currently the only open-sourced Chinese CLIP in the huggingface community.
38
 
39
  ### 下游效果 Performance
40
 
42
 
43
  | model | dataset | Top1 | Top5 |
44
  | ---- | ---- | ---- | ---- |
45
+ | Taiyi-CLIP-RoBERTa-102M-ViT-L-Chinese | ImageNet1k-CN | 55.04% | 81.75% |
46
 
47
  **Zero-Shot Text-to-Image Retrieval**
48
 
49
  | model | dataset | Top1 | Top5 | Top10 |
50
  | ---- | ---- | ---- | ---- | ---- |
51
+ | Taiyi-CLIP-RoBERTa-102M-ViT-L-Chinese | Flickr30k-CNA-test | 58.32% | 82.96% | 89.40% |
52
+ | Taiyi-CLIP-RoBERTa-102M-ViT-L-Chinese | COCO-CN-test | 55.27% | 81.10% | 90.78% |
53
+ | Taiyi-CLIP-RoBERTa-102M-ViT-L-Chinese | wukong50k | 64.95% | 91.77% | 96.28% |
54
 
55
  ## 使用 Usage
56
 
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ee35cc92a37beaf183455091f3bb817788180bf781dc8562c9ecbca39ad9b426
3
  size 409140017
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6bf3132c35b05a0c9d4c16d4b0693855f24a854c7379c985ea5049449731e540
3
  size 409140017