danielhanchen commited on
Commit
99334dc
1 Parent(s): c7f05cb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -6
README.md CHANGED
@@ -12,12 +12,14 @@ This repo includes:
12
  ```
13
  from transformers import LlamaTokenizerFast
14
  from tokenizers import AddedToken
15
- tokenizer = LlamaTokenizerFast.from_pretrained("openlm-research/open_llama_3b_600bt_preview",
16
- add_bos_token = True, add_eos_token = True,
17
- bos_token = AddedToken("<s>", single_word = True),
18
- eos_token = AddedToken("</s>", single_word = True),
19
- unk_token = AddedToken("<unk>", single_word = True),
20
- pad_token = AddedToken("<unk>", single_word = True))
 
 
21
  ```
22
  2) `AutoTokenizer` does not recognize the BOS, EOS and UNK tokens. All tokenizations weirdly prepend 0 and append 0 to the end, when actually, you're supposed to prepend 1 and append 2.
23
  3) Manually added BOS `<s>`, EOS `</s>`, UNK `<unk>` tokens, with PAD (padding) being also the `<unk>` token.
 
12
  ```
13
  from transformers import LlamaTokenizerFast
14
  from tokenizers import AddedToken
15
+ tokenizer = LlamaTokenizerFast.from_pretrained(
16
+ "openlm-research/open_llama_3b_600bt_preview",
17
+ add_bos_token = True, add_eos_token = True,
18
+ bos_token = AddedToken("<s>", single_word = True),
19
+ eos_token = AddedToken("</s>", single_word = True),
20
+ unk_token = AddedToken("<unk>", single_word = True),
21
+ pad_token = AddedToken("<unk>", single_word = True)
22
+ )
23
  ```
24
  2) `AutoTokenizer` does not recognize the BOS, EOS and UNK tokens. All tokenizations weirdly prepend 0 and append 0 to the end, when actually, you're supposed to prepend 1 and append 2.
25
  3) Manually added BOS `<s>`, EOS `</s>`, UNK `<unk>` tokens, with PAD (padding) being also the `<unk>` token.