clip-huge-zh-75k-steps-bs4096

Brief Introduction

训练该模型的目的是使用中文文本指导stable diffusion 2模型进行生成。冻结open_clip的CLIP-VIT-H图像编码部分，训练文本编码部分以对齐英文语义空间，训练样本均来自LAION-5B的中文子集

注：由于数据量，bs，step远小于原生clip-h,所以模型远未收敛且远未达到huge模型该有的性能，只是作为stable diffusion 2的文本指导的中间结果, 欢迎基于该模型做二次开发强化其CLIP性能。

The purpose of training this model is to use chinese text guiding stable difussion 2 generation. Freezing only the vision part of CLIP-VIT-H and train the text encoder can align chinese latent space to the original english latent space. All training samples are from chinese subset of LAION-5B

Note: Because of smaller dataset size, batch size and steps, this model is still far away from expected performance and convergence. It is only expected as the middle result for stable diffusion 2 text encoder. You are very welcome to do further training based on this model to enhance its 'CLIP' performance.

Stable Diffusion 2 Guiding Example

赛博朋克风格的城市街道

一只可爱的柴犬

Training Details

文本编码器/Text Encoder

文本编码器采用与stable diffusion 2同样的结构：open_clip的CLIP-VIT-H. 为了使中文编码在语义空间内尽量与原来英文编码器的语义距离接近，文本编码器的训练细节如下：

暴力的替换原来英文版本的clip_huge的文本编码器的vocab与tokenizer为chinese roberta的vocab与tokenizer
完整copy原英文编码器的所有权重
冻结图像编码器的全部参数与文本编码器的编码部分与输出映射部分，只训练词嵌入，目的是在保留语义空间尽量不变的情况下，将中文词嵌入对齐英文词嵌入的语义空间。
在训练多个step后，完全解冻文本编码器，使整个文本模型去拟合clip_huge图像编码器的语义空间。

注：训练的loss采用clip loss，数据集采用LAION-5B数据集的中文子集部分（由于失效url等原因，共约8500万），模型在4096的batch size下共训练75k步，所以并未完全收敛。

Text encoder is the same structure as open_clip/CLIP-VIT-H which is used by stable diffusion 2. Our purpose is mapping chinese latent space to the original english one. The training details are listed below:

Do brute force in-place vocab substitution: directly use chinese tokened sequence to pick up embedding vectors from the original embedding layer.
Copy the original model weights from the text encoder of CLIP-VIT-H
Freeze the entire visual model, text encoder layer as well as the text projection layer. Only the text embedding layer is unfrozen. The purpose of this step is to align chinese word embedding with the original english word embedding such that the final projection latent space would not drift far away.
After a bunch of steps, unfreeze the entire text encoder for better convergence.

Note: We use clip loss to optimize chinese text encoder. Chinese subset of LAION-5B are chosen as our training set (around 85M text-image pairs). This model was trained 75k steps with 4096 batch size so it is still far away from convergence.

使用 Usage

Zero-Shot Classification

import torch
import numpy as np
import requests
from PIL import Image
from transformers import CLIPModel, CLIPFeatureExtractor, AutoTokenizer

model_id = "lyua1225/clip-huge-zh-75k-steps-bs4096"
model = CLIPModel.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
processor = CLIPFeatureExtractor.from_pretrained(model_id)

# online example from OFA-Sys
url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
texts = ["杰尼龟", "妙蛙种子", "皮卡丘", "小火龙"]

# compute image feature
inputs = torch.from_numpy(processor(image).pixel_values[0]).unsqueeze(0)
image_features = model.get_image_features(pixel_values=inputs)
image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)  

# compute text features
inputs = tokenizer(text=texts, padding="max_length", max_length=77, return_tensors="pt")
input_ids, attention_mask = inputs.input_ids, inputs.attention_mask
input_dict = dict(input_ids=input_ids, attention_mask=attention_mask)
text_features = model.get_text_features(**input_dict)
text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute probs for each class
logit_scale = model.logit_scale.exp()
logits_per_image = logit_scale * image_features @ text_features.t()
logits_per_text = logits_per_image.t()
probs = logits_per_image.softmax(dim=-1).detach().numpy()
print(np.around(probs, 3))

Guiding Stable Diffusion V2.1

使用该中文模型可以指导stable diffusion 2 进行生成(在图灵架构或者V100以后的GPU上推荐使用FP16进行推理)

import torch
from diffusers import StableDiffusionPipeline
from transformers import AutoTokenizer, CLIPTextModel

clip_id = "lyua1225/clip-huge-zh-75k-steps-bs4096"
sd2_id = "stabilityai/stable-diffusion-2-1"

text_encoder = CLIPTextModel.from_pretrained(clip_id).half()
tokenizer = AutoTokenizer.from_pretrained(clip_id, trust_remote_code=True)
pipe = StableDiffusionPipeline.from_pretrained(sd2_id, torch_dtype=torch.float16, revision="fp16",
                                               tokenizer=tokenizer, text_encoder=text_encoder)
pipe.to("cuda")

image = pipe("赛博朋克风格的城市街道", num_inference_steps=20).images[0]
image.save("cyberpunk.jpeg")