YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Bilingual Language Model with BPE Tokenizer
Hey there! This is my cool project where I built a language model that can handle both English and Chinese. It's like teaching a computer to understand and generate text in two totally different languages!
What's in this project?
BPE Tokenizer: I made a special tool called a Byte Pair Encoding (BPE) tokenizer. It's like a smart dictionary that breaks words into smaller pieces, which helps the model understand language better.
LSTM Language Model: I used something called an LSTM (Long Short-Term Memory) neural network. It's a fancy way of saying the model can remember important stuff from earlier in a sentence to help predict what comes next.
Bilingual Support: The coolest part is that this model can work with both English and Chinese! It's like having a brain that can switch between two languages.
How it works
- The BPE tokenizer learns to split words into subwords. For example, "playing" might become "play" + "ing".
- The LSTM model then learns patterns in these subwords to predict what comes next in a sentence.
- When we want to generate text, the model uses what it learned to create new sentences in either English or Chinese.
What I learned
- How to implement a BPE tokenizer from scratch (it was tricky but fun!)
- Training a neural network on a bilingual dataset
- Dealing with perplexity (a way to measure how good the model is at predicting text)
- Generating text in two very different languages
Results
My model achieved a perplexity score of 32.83, which is pretty good for a bilingual model! It means the model is getting better at predicting what word comes next in a sentence.
Next steps
I'm thinking of trying out more advanced models like Transformers to see if I can make it even better. Also, I want to experiment with more languages - maybe add Spanish or French to the mix!
Thanks for checking out my project! Feel free to play around with it and let me know what you think. ๐