English
pszemraj commited on
Commit
affb7f0
1 Parent(s): 380ff73

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -3
README.md CHANGED
@@ -27,8 +27,8 @@ print(f"Tokens:\n\t{output.input_ids}")
27
 
28
  ## Notes
29
 
30
- - the default tokenizer (on branch `main`) has a vocab size of 32100
31
-
32
 
33
 
34
  <details>
@@ -75,4 +75,17 @@ print(f"Tokens:\n\t{output.input_ids}")
75
  >>> print(f"Offsets: {offsets}")
76
  Offsets: [(0, 4), (4, 12), (12, 17), (17, 24), (24, 35), (35, 42), (42, 47), (47, 51), (51, 55), (55, 58), (58, 69), (69, 73), (73, 79), (79, 84), (84, 90), (90, 97), (97, 101), (101, 106), (106, 113), (113, 121), (121, 125), (125, 130), (130, 136), (136, 145), (145, 150), (150, 154), (154, 161), (161, 165), (165, 176), (176, 179), (179, 184), (184, 189), (189, 199), (199, 206), (206, 214)]
77
  ```
78
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
  ## Notes
29
 
30
+ 1. the default tokenizer (on branch `main`) has a vocab size of 32100.
31
+ - use a model vocab size of 32128 because GPUs like this better
32
 
33
 
34
  <details>
 
75
  >>> print(f"Offsets: {offsets}")
76
  Offsets: [(0, 4), (4, 12), (12, 17), (17, 24), (24, 35), (35, 42), (42, 47), (47, 51), (51, 55), (55, 58), (58, 69), (69, 73), (73, 79), (79, 84), (84, 90), (90, 97), (97, 101), (101, 106), (106, 113), (113, 121), (121, 125), (125, 130), (130, 136), (136, 145), (145, 150), (150, 154), (154, 161), (161, 165), (165, 176), (176, 179), (179, 184), (184, 189), (189, 199), (199, 206), (206, 214)]
77
  ```
78
+
79
+ if you compare this to the output of [the llama tokenizer](https://huggingface.co/fxmarty/tiny-llama-fast-tokenizer) (below), you can quickly see which is more suited for beekeeping related language modeling.
80
+
81
+ ```
82
+ >>> print(f"Test string: {test_string}")
83
+ Test string: When dealing with Varroa destructor mites, it's crucial to administer the right acaricides during the late autumn months, but only after ensuring that the worker bee population is free from pesticide contamination.
84
+ >>> # Tokens
85
+ >>> tokens = tokenizer.convert_ids_to_tokens(output['input_ids'])
86
+ >>> print(f"Tokens: {toke>>> print(f"Tokens: {tokens}")
87
+ Tokens: ['<s>', '▁When', '▁dealing', '▁with', '▁Var', 'ro', 'a', '▁destruct', 'or', '▁mit', 'es', ',', '▁it', "'", 's', '▁cru', 'cial', '▁to', '▁admin', 'ister', '▁the', '▁right', '▁ac', 'ar', 'ic', 'ides', '▁during', '▁the', '▁late', '▁aut', 'umn', '▁months', ',', '▁but', '▁only', '▁after', '▁ens', 'uring', '▁that', '▁the', '▁worker', '▁be', 'e', '▁population', '▁is', '▁free', '▁from', '▁p', 'estic', 'ide', '▁cont', 'am', 'ination', '.']
88
+ >>> offsets = output['offset_mapping']
89
+ >>> print(f"Offsets: {offsets}")
90
+ Offsets: [(0, 0), (0, 4), (4, 12), (12, 17), (17, 21), (21, 23), (23, 24), (24, 33), (33, 35), (35, 39), (39, 41), (41, 42), (42, 45), (45, 46), (46, 47), (47, 51), (51, 55), (55, 58), (58, 64), (64, 69), (69, 73), (73, 79), (79, 82), (82, 84), (84, 86), (86, 90), (90, 97), (97, 101), (101, 106), (106, 110), (110, 113), (113, 120), (120, 121), (121, 125), (125, 130), (130, 136), (136, 140), (140, 145), (145, 150), (150, 154), (154, 161), (161, 164), (164, 165), (165, 176), (176, 179), (179, 184), (184, 189), (189, 191), (191, 196), (196, 199), (199, 204), (204, 206), (206, 213), (213, 214)]
91
+ ```