Alexandru Gherghescu commited on
Commit
8b5602c
1 Parent(s): fe8246f

Fix tokenizer

Browse files

Instead of having a trained tokenizer from scratch, replace it with the
actual tokenizer used by the original model.

Note that while the vocabulary and the merges are those from the GPT1
model, the pre- and post-processing might be slightly different, due to
employing different methods of tokenization (spaCy vs HuggingFace's
tokenizers).

special_tokens_map.json CHANGED
@@ -1,3 +1,3 @@
1
  {
2
- "eos_token": "<|endoftext|>"
3
  }
 
1
  {
2
+ "unk_token": "<unk>"
3
  }
tokenizer.json CHANGED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json CHANGED
@@ -1,7 +1,7 @@
1
  {
2
  "added_tokens_decoder": {
3
  "0": {
4
- "content": "<|endoftext|>",
5
  "lstrip": false,
6
  "normalized": false,
7
  "rstrip": false,
@@ -10,7 +10,7 @@
10
  }
11
  },
12
  "clean_up_tokenization_spaces": true,
13
- "eos_token": "<|endoftext|>",
14
  "model_max_length": 1000000000000000019884624838656,
15
- "tokenizer_class": "PreTrainedTokenizerFast"
 
16
  }
 
1
  {
2
  "added_tokens_decoder": {
3
  "0": {
4
+ "content": "<unk>",
5
  "lstrip": false,
6
  "normalized": false,
7
  "rstrip": false,
 
10
  }
11
  },
12
  "clean_up_tokenization_spaces": true,
 
13
  "model_max_length": 1000000000000000019884624838656,
14
+ "tokenizer_class": "PreTrainedTokenizerFast",
15
+ "unk_token": "<unk>"
16
  }