Spaces:
Running
Running
# 背景知识 | |
GPT2采用的byte-level BPE,BERT采用的char-level BPE。 | |
- BPE on unicode sequence | |
- BPE on UTF-8 byte sequence | |
- | |
来自 https://huggingface.co/gpt2/tree/main | |
### BPE的问题 | |
- 直接BPE,会出现 dog. dog! 等合并成一个词。 | |
byte-level BPE | |
- bpe会把空格拼接到后一个词上,比如 bpe.decode(bpes[1:2]) = ' world',在NER任务上是不是算把空格也标注进去了? | |
- bpe会把 'world'和' world'视为两个完全不同的token,不好吧? | |
- 大小写: | |
### 怎样解决 | |
### GPT2的 | |
# 下载 | |
### 官方 | |
### huggingface = 官方 | |
- [vocab.json](https://huggingface.co/gpt2-large/resolve/main/vocab.json): 50257个kv-pair. https://huggingface.co/gpt2/resolve/main/vocab.json | |
- [merges.txt](https://huggingface.co/gpt2-large/resolve/main/merges.txt): 50001行,https://huggingface.co/gpt2/resolve/main/merges.txt | |
- merges.txts是否包含所有的组合?https://github.com/huggingface/transformers/issues/4777 | |
### fairseq = 官方 | |
- vocab.bpe:50001行 | |
- encoder.json: 50257个kv-pair | |
- dict.txt: 50260行 是纯数字的,是由fairseq-preprocess生成的 https://github.com/pytorch/fairseq/issues/1186 | |
- https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json | |
- https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe | |
- https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt | |
# 相关疑问 | |
### Ġ是什么 | |
It's a feature of byte-level BPE(an encoded space character). | |
Ġ 表示空格,有的版本用Ä代替Ġ。 | |
``` | |
What's up with the tokenizer? | |
# BPE后 | |
['What', "'s", 'Ġup', 'Ġwith', 'Ġthe', 'Ġtoken', 'izer', '?'] | |
# 经过vocab.json编码后 | |
[ 2061, 338, 510, 351, 262, 11241, 7509, 30] | |
# 经过dict.txt编码后(fairseq特有) | |
[ 其他数字 ] | |
``` | |
疑问:up会加Ġ,为什么what不加Ġ | |
- https://github.com/pytorch/fairseq/issues/1716 | |
- https://github.com/huggingface/transformers/issues/1083 | |