danielhanchen
commited on
Commit
•
99334dc
1
Parent(s):
c7f05cb
Update README.md
Browse files
README.md
CHANGED
@@ -12,12 +12,14 @@ This repo includes:
|
|
12 |
```
|
13 |
from transformers import LlamaTokenizerFast
|
14 |
from tokenizers import AddedToken
|
15 |
-
tokenizer = LlamaTokenizerFast.from_pretrained(
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
|
|
|
|
21 |
```
|
22 |
2) `AutoTokenizer` does not recognize the BOS, EOS and UNK tokens. All tokenizations weirdly prepend 0 and append 0 to the end, when actually, you're supposed to prepend 1 and append 2.
|
23 |
3) Manually added BOS `<s>`, EOS `</s>`, UNK `<unk>` tokens, with PAD (padding) being also the `<unk>` token.
|
|
|
12 |
```
|
13 |
from transformers import LlamaTokenizerFast
|
14 |
from tokenizers import AddedToken
|
15 |
+
tokenizer = LlamaTokenizerFast.from_pretrained(
|
16 |
+
"openlm-research/open_llama_3b_600bt_preview",
|
17 |
+
add_bos_token = True, add_eos_token = True,
|
18 |
+
bos_token = AddedToken("<s>", single_word = True),
|
19 |
+
eos_token = AddedToken("</s>", single_word = True),
|
20 |
+
unk_token = AddedToken("<unk>", single_word = True),
|
21 |
+
pad_token = AddedToken("<unk>", single_word = True)
|
22 |
+
)
|
23 |
```
|
24 |
2) `AutoTokenizer` does not recognize the BOS, EOS and UNK tokens. All tokenizations weirdly prepend 0 and append 0 to the end, when actually, you're supposed to prepend 1 and append 2.
|
25 |
3) Manually added BOS `<s>`, EOS `</s>`, UNK `<unk>` tokens, with PAD (padding) being also the `<unk>` token.
|