Discrepancies between DONUT / BART Tokenizer and missing characters

#8
by DieseKartoffel - opened

Hi there,

as per our understanding of DONUT, it uses a pre-trained BART as the decoder. Specifically the asian-bart-ecjk model, which is mentioned in the paper on page 5. However, we noticed that the vocabulary in DONUT's tokenizer.json does not match the list of possible output tokens from the original asian-bart-ecjk tokenizer.json. Can someone explain why that is exactly? Where does the vocabulary of the donut-decoder come from if it has been pre-trained?

In addition, we found that there is no valid token id for the character "1". In practise that means, whenever a result from a Donut models contains a "1" that can not be tokenized into a larger token (ie something like "15") it will decode into token-id 3: "<unk>", which we observe regulary and are currently fixing manually in post-processing.

Here is a code sample to show what I am talking about:

from transformers import AutoTokenizer

donut_tokenizer = AutoTokenizer.from_pretrained("naver-clova-ix/donut-base")

text = "ABC1QRF7"
encoded = donut_tokenizer.encode(text)
decoded = donut_tokenizer.decode(encoded, skip_special_tokens=False) 
print(f"Decoded String: {decoded}") 

This will result in: <s>ABC<unk>QRF7</s>

Here is sample code to show that the vocabulary differs from the BART model referenced by the authors:

donut_donut_tokenizer = AutoTokenizer.from_pretrained("naver-clova-ix/donut-base")
donut_vocab = donut_tokenizer.get_vocab()
bart_tokenizer = AutoTokenizer.from_pretrained("hyunwoongko/asian-bart-ecjk")
bart_vocab = bart_tokenizer.get_vocab()

print(f"Donut Vocabulary size: {len(donut_vocab)}") # 57525
print(f"Bart Vocabulary size: {len(bart_vocab)}") # 57547
print("Tokens in Donut, not in Bart:", len(set(donut_vocab) - set(bart_vocab))) # 18601
print("Tokens in Bart, not in Donut:", len(set(bart_vocab) - set(donut_vocab))) # 18623

assert "1" in bart_vocab # works fine
assert "1" in donut_vocab # AssertionError

We would love to better our understanding of where the donut vocabulary comes from and if anyone else ran into this problem or found a better fix.

Thanks!

I found an interesting case with CORD example.
When I loaded the same tokenizer hyunwoongko/asian-bart-ecjk with XLMRobertaTokenizerFast, it does not produce <unk> token for 1.
On the otherhand, it produces <unk> when XLMRobertaTokenizer is used.

Did I miss something here?

from transformers import XLMRobertaTokenizer, XLMRobertaTokenizerFast

new_tokenizer2 = XLMRobertaTokenizer.from_pretrained("hyunwoongko/asian-bart-ecjk")
new_tokenizer2.add_special_tokens(
    {"additional_special_tokens": [
        "<s_cord-v2>", "</s>", "<s_menu>", "</s_menu>", "<s_nm>", "</s_nm>", "<s_unitprice>", "</s_unitprice>", "<s_cnt>", "</s_cnt>", 
        "<s_price>", "</s_price>", "<s_total>", "</s_total>", "<s_total_price>", "</s_total_price>", "<s_cashprice>", "</s_cashprice>",
        "<s_changeprice>", "</s_changeprice>", "<s_menuqty_cnt>", "</s_menuqty_cnt>"]},
     replace_additional_special_tokens=False)

new_tokenizer2_fast = XLMRobertaTokenizerFast.from_pretrained("hyunwoongko/asian-bart-ecjk")
new_tokenizer2_fast.add_special_tokens(
    {"additional_special_tokens": [
        "<s_cord-v2>", "</s>", "<s_menu>", "</s_menu>", "<s_nm>", "</s_nm>", "<s_unitprice>", "</s_unitprice>", "<s_cnt>", "</s_cnt>", 
        "<s_price>", "</s_price>", "<s_total>", "</s_total>", "<s_total_price>", "</s_total_price>", "<s_cashprice>", "</s_cashprice>",
        "<s_changeprice>", "</s_changeprice>", "<s_menuqty_cnt>", "</s_menuqty_cnt>"]},
     replace_additional_special_tokens=False)

# CORD example
s = "<s_cord-v2><s_menu><s_nm>2005-CHEESE JOHN</s_nm><s_unitprice>9.500,00</s_unitprice><s_cnt>x1</s_cnt><s_price>9.500,00</s_price></s_menu><s_total><s_total_price>9.500,00</s_total_price><s_cashprice>20.000,00</s_cashprice><s_changeprice>10.500</s_changeprice><s_menuqty_cnt>1</s_menuqty_cnt></s_total></s>"

input_ids2 = new_tokenizer2(s, add_special_tokens=False)["input_ids"]
restored_tokens2 = "".join(new_tokenizer2.convert_ids_to_tokens(input_ids2))
print(restored_tokens2)
if new_tokenizer2.unk_token_id in input_ids2:
    print("unk found!")
else:
    print("ok!")
print()

input_ids2_fast = new_tokenizer2_fast(s, add_special_tokens=False)["input_ids"]
restored_tokens2_fast = "".join(new_tokenizer2_fast.convert_ids_to_tokens(input_ids2_fast))
print(restored_tokens2_fast)
if new_tokenizer2_fast.unk_token_id in input_ids2_fast:
    print("unk found!")
else:
    print("ok!")
print()
<s_cord-v2><s_menu><s_nm>▁2005-CHEESE▁JOHN</s_nm><s_unitprice>▁9.500,00</s_unitprice><s_cnt>▁x<unk></s_cnt><s_price>▁9.500,00</s_price></s_menu><s_total><s_total_price>▁9.500,00</s_total_price><s_cashprice>▁20.000,00</s_cashprice><s_changeprice>▁10.500</s_changeprice><s_menuqty_cnt>▁1</s_menuqty_cnt></s_total></s>
unk found!

<s_cord-v2><s_menu><s_nm>▁2005-CHEESE▁JOHN</s_nm><s_unitprice>▁9.500,00</s_unitprice><s_cnt>▁x1</s_cnt><s_price>▁9.500,00</s_price></s_menu><s_total><s_total_price>▁9.500,00</s_total_price><s_cashprice>▁20.000,00</s_cashprice><s_changeprice>▁10.500</s_changeprice><s_menuqty_cnt>▁1</s_menuqty_cnt></s_total></s>
ok!

Sign up or log in to comment