Is there a HF equivalent tokenizer?
#2
by
versae
- opened
The documentation says the text model is XLM Roberta, but when I compare the token ids of the HF tokenizer and the OpenCLIP tokenizer I get very different results.
import numpy as np
import open_clip
from transformers import AutoTokenizer
roberta = AutoTokenizer.from_pretrained("xlm-roberta-base")
tokenizer = open_clip.get_tokenizer('ViT-B-32')
np.array(roberta.encode("A dog", padding="max_length", max_length=77))
# array([ 0, 62, 10269, 2, 1, 1, 1, 1, 1,
# 1, 1, 1, 1, 1, 1, 1, 1, 1,
# 1, 1, 1, 1, 1, 1, 1, 1, 1,
# 1, 1, 1, 1, 1, 1, 1, 1, 1,
# 1, 1, 1, 1, 1, 1, 1, 1, 1,
# 1, 1, 1, 1, 1, 1, 1, 1, 1,
# 1, 1, 1, 1, 1, 1, 1, 1, 1,
# 1, 1, 1, 1, 1, 1, 1, 1, 1,
# 1, 1, 1, 1, 1])
tokenizer("A dog")[0]
# tensor([49406, 320, 1929, 49407, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0])
Oops, nevermind. The right model to use in OpenCLIP was xlm-roberta-base-ViT-B-32
:
In [79]: tokenizer = open_clip.get_tokenizer('xlm-roberta-base-ViT-B-32')
In [80]: tokenizer("A dog")[0]
Out[80]:
tensor([ 0, 62, 10269, 2, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1])
In [81]: np.array(roberta.encode("A dog", padding="max_length", max_length=77))
Out[81]:
array([ 0, 62, 10269, 2, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1])
versae
changed discussion status to
closed