laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k · Is there a HF equivalent tokenizer?

Sep 18, 2023

The documentation says the text model is XLM Roberta, but when I compare the token ids of the HF tokenizer and the OpenCLIP tokenizer I get very different results.

import numpy as np
import open_clip
from transformers import AutoTokenizer
roberta = AutoTokenizer.from_pretrained("xlm-roberta-base")
tokenizer = open_clip.get_tokenizer('ViT-B-32')
np.array(roberta.encode("A dog", padding="max_length", max_length=77))
# array([    0,    62, 10269,     2,     1,     1,     1,     1,     1,
#            1,     1,     1,     1,     1,     1,     1,     1,     1,
#            1,     1,     1,     1,     1,     1,     1,     1,     1,
#            1,     1,     1,     1,     1,     1,     1,     1,     1,
#            1,     1,     1,     1,     1,     1,     1,     1,     1,
#            1,     1,     1,     1,     1,     1,     1,     1,     1,
#            1,     1,     1,     1,     1,     1,     1,     1,     1,
#            1,     1,     1,     1,     1,     1,     1,     1,     1,
#            1,     1,     1,     1,     1])

tokenizer("A dog")[0]
# tensor([49406,   320,  1929, 49407,     0,     0,     0,     0,     0,     0,
#             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
#             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
#             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
#             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
#             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
#             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
#             0,     0,     0,     0,     0,     0,     0])

versae

Sep 18, 2023

Oops, nevermind. The right model to use in OpenCLIP was xlm-roberta-base-ViT-B-32:

In [79]: tokenizer = open_clip.get_tokenizer('xlm-roberta-base-ViT-B-32')

In [80]: tokenizer("A dog")[0]
Out[80]: 
tensor([    0,    62, 10269,     2,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1])

In [81]: np.array(roberta.encode("A dog", padding="max_length", max_length=77))
Out[81]: 
array([    0,    62, 10269,     2,     1,     1,     1,     1,     1,
           1,     1,     1,     1,     1,     1,     1,     1,     1,
           1,     1,     1,     1,     1,     1,     1,     1,     1,
           1,     1,     1,     1,     1,     1,     1,     1,     1,
           1,     1,     1,     1,     1,     1,     1,     1,     1,
           1,     1,     1,     1,     1,     1,     1,     1,     1,
           1,     1,     1,     1,     1,     1,     1,     1,     1,
           1,     1,     1,     1,     1,     1,     1,     1,     1,
           1,     1,     1,     1,     1])

versae changed discussion status to closed Sep 18, 2023