Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
Jaward 
posted an update 19 days ago
Post
1431
Build your own GPT-4 Tokenizer! - @karpathy 's minbpe exercise.
Step 1: BasicTokenizer
Got "close" to beating minbpe's train speed :(
step 2 RegexTokenizer coming soon.

Notes on lessons learned:
- tokenization is the assembly language of LLMs:)
It's not a healthy choice to code it lol.
- encoding can literally drive you mad.
- merging is where sh*t gets real - moment of truth:)
- training requires precision.
- decoding is trivial.

Okay GPT-4o just helped me beat karpathy's minbpe train speed 1.2x faster in one shot - can finally agree on the "o" meaning "omni":)

Improvements

  • efficient merging and getstats: got rid of redundancy in computings merge and getstats

In this post