does it support Chinese and English mixed input?

#1
by Baicai003 - opened

such as "a big apple, 香蕉,橘子"

OFA-Sys org

Hi, first we would suggest to try your samples on our spaces https://huggingface.co/spaces/OFA-Sys/chinese-clip-zero-shot-image-classification instead of the demo on the model card. The demo on model card is automatically generated by transformers package and currently only support a fixed English prompt template. On spaces, you can try your own customed (of course we have a default Chinese prompt template) templates, which may have better performance. Go back to the question, since Chinese-CLIP is initialized from OpenAI CLIP weight and there may be samples with other languages mixed in our pretraining dataset, the model can support some cases where the input languages are mixed (note that on this demo the prompt template is in English but it still works), including your mentioned case "a big apple, 香蕉,橘子" (we just tried it 😁 on the spaces, you can have a try). However, we cannot make a guarantee that this always works indeed. Maybe using the huge-size model will have better performance in this setting.

I see the "vocab_size" is 21128, and in openai clip, it should be 49408.
When I try to use ChineseCLIPTextModel from transformer to get text_embeddings with "Chinese and English" mixed input (also some other char as "()", ":"), I got a error which says that the index out of range..

In fact, the text encoder and tokenizer of Chinese-CLIP has nothing to do with OpenAI CLIP. Its vocab size of 21128 is adopted from Chinese BERT (or say the Chinese Roberta-wwm-ext https://huggingface.co/hfl/chinese-roberta-wwm-ext) model. May I ask for the code usage of computing the text embedding? Are you following the code snippet we provide in the document (https://huggingface.co/docs/transformers/model_doc/chinese_clip#usage) or in the model card? It should support language-mixed text input in our test, or we will never see that our demo can still work with such input.

I just tried the example code on diffusers's github Readme.md, and switch the text encoder to ChineseCLIPTextModel.

import torch
from diffusers import StableDiffusionPipeline
from transformers import ChineseCLIPTextModel

pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
pipe.text_encoder = ChineseCLIPTextModel.from_pretrained("OFA-Sys/chinese-clip-vit-large-patch16")
pipe = pipe.to("cuda")

prompt = "masterpiece, best quality, a photo of 漂亮女孩,蓝色眼睛,金色头发"
image = pipe(prompt).images[0]  

I do not know how to make the chinese clip work with stable diffusion.

Hi, I think you should change the tokenizer config as well, since the tokenizers of Chinese CLIP and openai CLIP are not the same. Then I think you need to finetune the text encoder since the embedding space of Chinese CLIP has been changed from openai CLIP during the pretraining of Chinese CLIP.

Sign up or log in to comment