dpfried commited on
Commit
01f4604
1 Parent(s): 9a7562b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -3
README.md CHANGED
@@ -39,14 +39,16 @@ See [https://github.com/dpfried/incoder](https://github.com/dpfried/incoder) for
39
  `model = AutoModelForCausalLM.from_pretrained("facebook/incoder-1B")`
40
 
41
  ### Tokenizer
42
- `tokenizer = AutoTokenizer.from_pretrained("facebook/incoder-1B")`.
43
 
44
- Note: the incoder-1B and incoder-6B tokenizers are identical, so 'facebook/incoder-6B' could also be used.
45
 
46
- When calling `tokenizer.decode`, it's important to pass `clean_up_tokenization_spaces=False` to avoid removing spaces after punctuation:
47
 
48
  `tokenizer.decode(tokenizer.encode("from ."), clean_up_tokenization_spaces=False)`
49
 
 
 
50
  ## License
51
 
52
  CC-BY-NC 4.0
39
  `model = AutoModelForCausalLM.from_pretrained("facebook/incoder-1B")`
40
 
41
  ### Tokenizer
42
+ `tokenizer = AutoTokenizer.from_pretrained("facebook/incoder-1B")`
43
 
44
+ (Note: the incoder-1B and incoder-6B tokenizers are identical, so 'facebook/incoder-6B' could also be used.)
45
 
46
+ When calling `tokenizer.decode`, it's important to pass `clean_up_tokenization_spaces=False` to avoid removing spaces after punctuation. For example:
47
 
48
  `tokenizer.decode(tokenizer.encode("from ."), clean_up_tokenization_spaces=False)`
49
 
50
+ (Note: encoding prepends the `<|endoftext|>` token, as this marks the start of a document to our model. This token can be removed from the decoded output by passing `skip_special_tokens=True` to `tokenizer.decode`.)
51
+
52
  ## License
53
 
54
  CC-BY-NC 4.0