eson's picture
update
751936e
|
raw
history blame
No virus
757 Bytes

词典大小 250680 来自 https://huggingface.co/bigscience/bloom#preprocessing "vocab_size": 250880

OOV

有些空格没编码进去,详见test_oov.py

中文词典

一个中文几个id?

  "pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": " ?[^(\\s|[.,!?…。,、।۔،])]+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "ByteLevel",
        "add_prefix_space": false,
        "trim_offsets": true,
    "use_regex": false
      }
    ]
  },
  "post_processor": {
    "type": "ByteLevel",
    "add_prefix_space": true,
    "trim_offsets": false,
    "use_regex": false

  },