File size: 657 Bytes
751936e
 
 
 
 
 
 
 
 
f4973d4
751936e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29




## vocab_file

- ice_text.model
  - 二进制文件
  - num_image_tokens = 20000
文本词典大小=150528-20000


```
tokens:  ['▁good', '▁morning'] ;	            id:  [20315, 21774] ;	            text:  good morning
tokens:  ['▁good', '<|blank_2|>', 'morning'] ;	id:  [20315, 150009, 60813] ;	    text:  good  morning
tokens:  ['▁', 'goog', '▁morning', 'abc'] ;     id:  [20005, 46456, 21774, 27415] ;	text:  goog morningabc
tokens:  ['▁', '你是谁'] ;	                    id:  [20005, 128293] ;	            text:  你是谁
```

`▁` 是啥,空格吗?注意区分 `_`


## 

```
    tokenizer = TextTokenizer(self.vocab_file)
```