danielhanchen
/

open_llama_3b_600bt_preview

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

danielhanchen commited on May 28, 2023

Commit

18767ae

•

1 Parent(s): bc30c71

Update README.md

Files changed (1) hide show

README.md +5 -3

README.md CHANGED Viewed

@@ -8,13 +8,15 @@ Original model from https://huggingface.co/openlm-research/open_llama_3b_600bt_p
 This repo includes:
 1) Ported `LlamaTokenizer` to `LlamaTokenizerFast` via a few lines of code.
-   Loading via `AutoTokenizer` takes 3 to 4 minutes. Now, a few seconds!
 ```
 from transformers import LlamaTokenizerFast
 from tokenizers import AddedToken
 tokenizer = LlamaTokenizerFast.from_pretrained(
     "openlm-research/open_llama_3b_600bt_preview",
-    add_bos_token = True, add_eos_token = True,
     bos_token = AddedToken("<s>",   single_word = True),
     eos_token = AddedToken("</s>",  single_word = True),
     unk_token = AddedToken("<unk>", single_word = True),
@@ -22,5 +24,5 @@ tokenizer = LlamaTokenizerFast.from_pretrained(
 )
 tokenizer.push_to_hub("open_llama_3b_600bt_preview")
 ```
-2) `AutoTokenizer` does not recognize the BOS, EOS and UNK tokens. All tokenizations weirdly prepend 0 and append 0 to the end, when actually, you're supposed to prepend 1 and append 2.
 3) Manually added BOS `<s>`, EOS `</s>`, UNK `<unk>` tokens, with PAD (padding) being also the `<unk>` token.

 This repo includes:
 1) Ported `LlamaTokenizer` to `LlamaTokenizerFast` via a few lines of code.
+   Loading via `AutoTokenizer` takes 4 to 5 minutes. Now, a few seconds!
+   Essentially the porting is done via the below code:
 ```
 from transformers import LlamaTokenizerFast
 from tokenizers import AddedToken
 tokenizer = LlamaTokenizerFast.from_pretrained(
     "openlm-research/open_llama_3b_600bt_preview",
+    add_bos_token = True,
+    add_eos_token = False, # Original LLaMA is False -> add </s> during processing.
     bos_token = AddedToken("<s>",   single_word = True),
     eos_token = AddedToken("</s>",  single_word = True),
     unk_token = AddedToken("<unk>", single_word = True),
 )
 tokenizer.push_to_hub("open_llama_3b_600bt_preview")
 ```
+2) `AutoTokenizer` does not recognize the BOS, EOS and UNK tokens. Weirdly `<unk>` ie the 0 token was added instead of the `<s>` or `</s>` token.
 3) Manually added BOS `<s>`, EOS `</s>`, UNK `<unk>` tokens, with PAD (padding) being also the `<unk>` token.