Add HF tokenizer converted from SentencePiece

Files changed (7) hide show

convert.sh ADDED Viewed

+# For models trained with SentencePiece with byte_fallback for autoregressive models
+# ./spm_train --vocab_size 32000 --character_coverage 1.0 --hard_vocab_limit  --model_type bpe --pad_id 3 --shuffle_input_sentence true --model_prefix ./sentencepiece.model --byte_fallback=true --input text.txt  --input_sentence_size=100000 --num_threads 8
+wget -O sentencepiece_extractor.py https://raw.githubusercontent.com/huggingface/tokenizers/master/bindings/python/scripts/sentencepiece_extractor.py
+python sentencepiece_extractor.py --provider sentencepiece --model sentencepiece.model --merges-output-path ./merges.txt --vocab-output-path ./vocab.json
+python <<EOF
+from transformers import AutoTokenizer
+from tokenizers import SentencePieceBPETokenizer
+SentencePieceBPETokenizer.from_file("./vocab.json", "./merges.txt")
+tokenizer = SentencePieceBPETokenizer.from_file("./vocab.json", "./merges.txt")
+tokenizer.model.byte_fallback=True
+tokenizer.model.fuse_unk=True
+tokenizer.save("./tokenizer.json")
+htok = AutoTokenizer.from_pretrained("./")
+htok.padding_side = "right"
+htok.save_pretrained("./")
+EOF

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

sentencepiece.vocab → sentencepiece.vocab.bak RENAMED Viewed

File without changes

special_tokens_map.json ADDED Viewed

+{
+  "bos_token": "<s>",
+  "eos_token": "</s>",
+  "unk_token": "<unk>"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

+{
+  "add_bos_token": true,
+  "add_eos_token": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "</s>",
+  "model_max_length": 1000000000000000019884624838656,
+  "tokenizer_class": "LlamaTokenizer",
+  "unk_token": "<unk>",
+  "use_default_system_prompt": false
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff