# clip-huge-zh-75k-steps-bs4096

## Brief Introduction

The purpose of training this model is to use chinese text guiding stable difussion 2 generation. Freezing only the vision part of CLIP-VIT-H and train the text encoder can align chinese latent space to the original english latent space. All training samples are from chinese subset of LAION-5B

Note: Because of smaller dataset size, batch size and steps, this model is still far away from expected performance and convergence. It is only expected as the middle result for stable diffusion 2 text encoder. You are very welcome to do further training based on this model to enhance its 'CLIP' performance.

## Training Details

### 文本编码器/Text Encoder

1. 暴力的替换原来英文版本的clip_huge的文本编码器的vocab与tokenizer为chinese roberta的vocab与tokenizer

2. 完整copy原英文编码器的所有权重

3. 冻结图像编码器的全部参数与文本编码器的编码部分与输出映射部分，只训练词嵌入，目的是在保留语义空间尽量不变的情况下，将中文词嵌入对齐英文词嵌入的语义空间。

4. 在训练多个step后，完全解冻文本编码器，使整个文本模型去拟合clip_huge图像编码器的语义空间。

Text encoder is the same structure as open_clip/CLIP-VIT-H which is used by stable diffusion 2. Our purpose is mapping chinese latent space to the original english one. The training details are listed below:

1. Do brute force in-place vocab substitution: directly use chinese tokened sequence to pick up embedding vectors from the original embedding layer.
2. Copy the original model weights from the text encoder of CLIP-VIT-H
3. Freeze the entire visual model, text encoder layer as well as the text projection layer. Only the text embedding layer is unfrozen. The purpose of this step is to align chinese word embedding with the original english word embedding such that the final projection latent space would not drift far away.
4. After a bunch of steps, unfreeze the entire text encoder for better convergence.

Note: We use clip loss to optimize chinese text encoder. Chinese subset of LAION-5B are chosen as our training set (around 85M text-image pairs). This model was trained 75k steps with 4096 batch size so it is still far away from convergence.

## 使用 Usage

### Zero-Shot Classification

import torch
import numpy as np
import requests
from PIL import Image
from transformers import CLIPModel, CLIPFeatureExtractor, AutoTokenizer

model_id = "lyua1225/clip-huge-zh-75k-steps-bs4096"
model = CLIPModel.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
processor = CLIPFeatureExtractor.from_pretrained(model_id)

# online example from OFA-Sys
url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
texts = ["杰尼龟", "妙蛙种子", "皮卡丘", "小火龙"]

# compute image feature
inputs = torch.from_numpy(processor(image).pixel_values[0]).unsqueeze(0)
image_features = model.get_image_features(pixel_values=inputs)
image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)

# compute text features
inputs = tokenizer(text=texts, padding="max_length", max_length=77, return_tensors="pt")
text_features = model.get_text_features(**input_dict)
text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute probs for each class
logit_scale = model.logit_scale.exp()
logits_per_image = logit_scale * image_features @ text_features.t()
logits_per_text = logits_per_image.t()
probs = logits_per_image.softmax(dim=-1).detach().numpy()
print(np.around(probs, 3))


### Guiding Stable Diffusion V2.1

import torch
from diffusers import StableDiffusionPipeline
from transformers import AutoTokenizer, CLIPTextModel

clip_id = "lyua1225/clip-huge-zh-75k-steps-bs4096"
sd2_id = "stabilityai/stable-diffusion-2-1"

text_encoder = CLIPTextModel.from_pretrained(clip_id).half()
tokenizer = AutoTokenizer.from_pretrained(clip_id, trust_remote_code=True)
pipe = StableDiffusionPipeline.from_pretrained(sd2_id, torch_dtype=torch.float16, revision="fp16",
tokenizer=tokenizer, text_encoder=text_encoder)
pipe.to("cuda")

image = pipe("赛博朋克风格的城市街道", num_inference_steps=20).images[0]
image.save("cyberpunk.jpeg")