eson's picture
add more tokenizers
f4973d4

vocab_file

  • ice_text.model
    • 二进制文件
    • num_image_tokens = 20000 文本词典大小=150528-20000
tokens:  ['▁good', '▁morning'] ;	            id:  [20315, 21774] ;	            text:  good morning
tokens:  ['▁good', '<|blank_2|>', 'morning'] ;	id:  [20315, 150009, 60813] ;	    text:  good  morning
tokens:  ['▁', 'goog', '▁morning', 'abc'] ;     id:  [20005, 46456, 21774, 27415] ;	text:  goog morningabc
tokens:  ['▁', '你是谁'] ;	                    id:  [20005, 128293] ;	            text:  你是谁

是啥,空格吗?注意区分 _

    tokenizer = TextTokenizer(self.vocab_file)