Text Generation
Transformers
Safetensors
English
llama
conversational
Inference Endpoints
text-generation-inference

Too much Junk vocab words in the vocab.json.

#28
by MartialTerran - opened

The vocab used in this LLM is full of useless junk words, including many chinese characters. https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/raw/main/tokenizer.json
Using a proper vocabulary and tokenizer would greatly improve performance and energy-efficiency of any resulting model.

Toda la razón le dices algo al modelo y te analiza la frase palabra por palabra

I don't think those are "junk words". Those are word pieces which can are used to make up actual words. If you look at GPT2's tokenizer for example you'll see much of the same.

I visually inspected part of the 50000+ token vocab being used here and elsewhere and I saw useless junk BPE fragments that will obviously not be relevant to any kind of "TinyStories" Dataset. By Junk I mean "word pieces" of things that exist somewhere on the internet but are not parts of any real words that are commonly used in any English sentence. And, there are only about 40,000 unique words of any kind (including misspellings) used in the Tiny-Storiesv2 dataset, so at least 10,000 tokens in the 50,000 BPE vocab are unnecessary (junk) for TinyStoriesv2 purposes. The Microsoft team also did this for no good reason. Andrej Karpathy has trained models using the TinyStories dataset and he saw the value of producing a dataset-specific vocab_size=2048 vocab file. He built a 2048 BPE token vocab based on TinyStories:
These stages are designed to be run in order.

To tokenize data with the Llama 2 tokenizer:
python tinystories.py download
python tinystories.py pretokenize

To tokenize data with a custom tokenizer we train ourselves with sentencepiece, e.g.:
python tinystories.py download
python tinystories.py train_vocab --vocab_size=2048
python tinystories.py pretokenize --vocab_size=2048

https://github.com/karpathy/llama2.c/blob/master/tinystories.py
https://github.com/karpathy/llama2.c

Thus, he was able to build a TinyStories model using only 2048 tokens, not the 50,000+ tokens BPE vocabulary as in GPT-2.

See also the 27-token model investigated in the influential paper at https://jalammar.github.io/illustrated-gpt2/ (referenced in https://jalammar.github.io/illustrated-gpt2/ )

Okay but this model is not designed for Tiny Stories, it is designed to be completely general purpose for both natural language and code, and with the capability to interpret other languages as well (you may have noticed non-latin characters in the vocab, in addition to words clearly from other languages).

I think you may have been confused by "Tiny" in the model name. This model has ZERO relation to Tiny Stories or any other "Tiny" datasets.

Yes, I was just now confused by the "Tiny" misnomer (as in Tiny Stories). But, the BPE vocab employed in TinyLlama-1.1B-Chat-v1.0 is still full of unnecessary junk tokens that are bloating the model and unnecessarily consuming electrical power (in the output and input tokenization levels). For example, based on another quick visual inspection, there is no logical need to have three or more various dedicated tokens for: "/", "//", and "///" and "////" much less ""\," and "▁"\": 6634, and "▁\\": 6714, and "▁[ and ":(":

  "\\\"": 5931,

...
"\,": 5940,

  "▁[`": 5913,
  "Char": 5914,
  "▁Cap": 5915,
  "boldsymbol": 5916,
  "▁Roman": 5917,
  "▁Max": 5918,
  ":(": 5919,

  ">(": 5961,

  "▁['": 6024,

  ");\r": 6075,

  "▁$$\\": 6118,

"▁ž": 6145,

  ":\"": 6160,

  "////": 6165,

  "}:": 6177,

  "{{": 6224,

"(&": 6243,

  "}^{\\": 6292,
  "you": 6293,
  "/.": 6294,

  "▁..": 6317,

  "▁<?": 6319,

  "}_{\\": 6353,

  "{\"": 6377,

  "▁${": 6435,

  "}]": 6525,

  "#,": 6552,

  "▁š": 6569,
  "▁OS": 6570,
  "▁{}": 6571,

  "▁%>": 6580,

  "^*": 6622,

  "▁\"\\": 6634,

  ",-": 6653,

  "///": 6658,
  "▁feb": 6659,
  "▁-->": 6660,

  "?”": 6677,

  "▁†": 6700,

  "▁`\\": 6714,

  "▁>=": 6736,

  ">>": 6778,

  "▁[\"": 6796,

The BPE Vocabs used in most of the LLMs are full of such useless garbage undeserving of a dedicated token.

Given the emphasis on "coding" in this small model, if there were special sequences that were necessary for syntax (such as indent 4 spaces, 8 spaces, etc, as in python) (or in HTML pages etc) then those special sequences would be worthy of a dedicated token and given special handling in the vocab and in the tokenizer of the model. See the "tokenizer" video by Andrej Karpathy: https://www.youtube.com/watch?v=zduSFxRajkE

boldsymbol is likely referring to the the LaTeX command.

▁--> is referring to the end of a HTML comment (where means pre-fixed by a space).

▁${ is probably the JS template string specifier, again prefixed by a space.

▁[\", }_{\\, }^{\\ and "▁$$\\ are likely more latex commands.

Anything prefixed by means the token starts with a space rather than existing as a word-piece which explains the existence of ▁feb, ▁Max, ▁Roman and ▁Cap. This is opposite to BERT style tokenizers where everything is prefixed with a space by default, and the prefix of ## is used to denote word-pieces.

Char is probably a datatype which isn't prefixed by a space because it may usually be prefixed by a tab character in the source code it's from.

?” is probably due to the existence of question marks in book dialogue, hence the use of curly quotes rather than straight quotes.

>> is probably the bitshift operator in many languages (including prevalent use for streams in C++).

▁>= is math notation.

I could go on...

You're probably seeing a pattern here: common symbol occurrences end up getting folded into single tokens to reduce the context size needed to represent common strings of characters in the source training data... and considering TinyLlama was trained on a 7-to-3 ratio of language-to-code it makes sense a lot of the 'odd' tokens in the vocab are common strings used in programming.

I appreciate the effort and insights you have put into your response. I am glad to learn that many of these weird BPE tokens may serve a useful purpose somewhere in coding space or HTML, where they would help reduce the token count going into a GPT model.

I knew that underbar ▁ means: Begins with Space.

I did not know that in BERT style tokenizers everything is prefixed with a space by default, and the prefix of ## is used to denote word-pieces.

I know of a person on hugginface that built an experimental tintystoriesv2_train.txt model that uses explicit "space" tokens and uses the Shift-Key as a pre-token that indicates the next word's first-letter is Capitalized. This method costs more input tokens, but the model basically works as expected. https://huggingface.co/Corianas/Microllama_Char_100k_step

As for your review of "▁[`": 5913, through ":(": 5919,
I only just copy-pasted the whole series because of the beginning and end tokens of that series. I was not really interested in/objecting-to the in-between tokens. But, I appreciate your consideration and explanation of the in-between tokens.

Sign up or log in to comment