Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
hrishbhdalal 
posted an update May 14
Post
759
I just saw that openai is using an updated tokenizer and it greatly increases the speed of the model and maybe even performance as if we increase the size of vocabulary, it could predict a single token which might be equivalent to two or three tokens in current tokenizers with 50-60k tokens or even 100k. I was thinking of scaling this to a million vocabulary size and then pre training llama3 8b using lora. I know that the model might go to shit, but we can increase the speed of the tokens generation greatly imo. And as one of meta papers said that predicting multiple tokens at the same time can actually increase the performance of a model, so I can imagine increasing the vocabulary in this way means multiple token generation in a way. Yann Lecunn also says that we don’t think in tokens but more like representations or abstractions of situations or problem to be solved. Can scaling to a million vocab size or even 10 million vocab size lead to better and more robust models? Please give me your thoughts on what can go wrong, what can go right etc…

Yeah, I was thinking the same thing. A large vocabulary does improve the performance of smaller LLMs and judging by the GPT-4o the same is true for larger LLM. Give it a try. I'm just doing this for small size models up to 3B parameters.

In this post