danielhanchen commited on
Commit
18767ae
1 Parent(s): bc30c71

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -3
README.md CHANGED
@@ -8,13 +8,15 @@ Original model from https://huggingface.co/openlm-research/open_llama_3b_600bt_p
8
 
9
  This repo includes:
10
  1) Ported `LlamaTokenizer` to `LlamaTokenizerFast` via a few lines of code.
11
- Loading via `AutoTokenizer` takes 3 to 4 minutes. Now, a few seconds!
 
12
  ```
13
  from transformers import LlamaTokenizerFast
14
  from tokenizers import AddedToken
15
  tokenizer = LlamaTokenizerFast.from_pretrained(
16
  "openlm-research/open_llama_3b_600bt_preview",
17
- add_bos_token = True, add_eos_token = True,
 
18
  bos_token = AddedToken("<s>", single_word = True),
19
  eos_token = AddedToken("</s>", single_word = True),
20
  unk_token = AddedToken("<unk>", single_word = True),
@@ -22,5 +24,5 @@ tokenizer = LlamaTokenizerFast.from_pretrained(
22
  )
23
  tokenizer.push_to_hub("open_llama_3b_600bt_preview")
24
  ```
25
- 2) `AutoTokenizer` does not recognize the BOS, EOS and UNK tokens. All tokenizations weirdly prepend 0 and append 0 to the end, when actually, you're supposed to prepend 1 and append 2.
26
  3) Manually added BOS `<s>`, EOS `</s>`, UNK `<unk>` tokens, with PAD (padding) being also the `<unk>` token.
 
8
 
9
  This repo includes:
10
  1) Ported `LlamaTokenizer` to `LlamaTokenizerFast` via a few lines of code.
11
+ Loading via `AutoTokenizer` takes 4 to 5 minutes. Now, a few seconds!
12
+ Essentially the porting is done via the below code:
13
  ```
14
  from transformers import LlamaTokenizerFast
15
  from tokenizers import AddedToken
16
  tokenizer = LlamaTokenizerFast.from_pretrained(
17
  "openlm-research/open_llama_3b_600bt_preview",
18
+ add_bos_token = True,
19
+ add_eos_token = False, # Original LLaMA is False -> add </s> during processing.
20
  bos_token = AddedToken("<s>", single_word = True),
21
  eos_token = AddedToken("</s>", single_word = True),
22
  unk_token = AddedToken("<unk>", single_word = True),
 
24
  )
25
  tokenizer.push_to_hub("open_llama_3b_600bt_preview")
26
  ```
27
+ 2) `AutoTokenizer` does not recognize the BOS, EOS and UNK tokens. Weirdly `<unk>` ie the 0 token was added instead of the `<s>` or `</s>` token.
28
  3) Manually added BOS `<s>`, EOS `</s>`, UNK `<unk>` tokens, with PAD (padding) being also the `<unk>` token.