File size: 757 Bytes
751936e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47




词典大小  250680   来自 https://huggingface.co/bigscience/bloom#preprocessing 
"vocab_size": 250880


## OOV

有些空格没编码进去,详见`test_oov.py`

## 中文词典

一个中文几个id?


## 

```
  "pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": " ?[^(\\s|[.,!?…。,、।۔،])]+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "ByteLevel",
        "add_prefix_space": false,
        "trim_offsets": true,
	"use_regex": false
      }
    ]
  },
  "post_processor": {
    "type": "ByteLevel",
    "add_prefix_space": true,
    "trim_offsets": false,
    "use_regex": false

  },
```