README.md · Shaagun/LSTM

LSTM Language Model with BPE Tokenization

Project Overview This project implements a Bidirectional LSTM-based language model designed to predict the next word in a sequence using Byte-Pair Encoding (BPE) for tokenization. The model is trained on a dataset containing English and Lithuanian text to perform next-token prediction and text generation tasks. The training process involves the optimization of the model's parameters through backpropagation, with checkpoints saved periodically for model recovery and evaluation.

The model demonstrates significant improvements in terms of perplexity, starting with a high value of over 32,000 and progressively reducing to approximately 1.23 by the end of training. This shows that the model is highly efficient in predicting the next token and is well-suited for language modeling tasks.

Key Features Bidirectional LSTM: A stacked LSTM with two layers is used to capture both past and future dependencies in the text. Byte-Pair Encoding (BPE) Tokenization: BPE is applied to efficiently tokenize both English and Lithuanian text. Training and Validation Loss Tracking: The project monitors both training and validation loss to prevent overfitting and ensure the model generalizes well. Checkpoint Saving: Model checkpoints are saved at regular intervals, and the best-performing model based on validation loss is stored separately. Google Drive Integration: Checkpoints and final models are saved directly to Google Drive for persistent storage. Perplexity Monitoring: Perplexity is tracked and visualized, demonstrating how the model improves as training progresses. Dataset The model was trained on a combined dataset of English and Lithuanian text. The dataset is tokenized using a Byte-Level BPE tokenizer. The training data consists of 90% of the total dataset, while the remaining 10% is used for validation purposes.

Model Architecture The architecture consists of:

Embedding Layer: Converts input tokens to embeddings. LSTM Layers: Two bidirectional LSTM layers to capture context from both forward and backward sequences. Layer Normalization and Dropout: Improves training stability and prevents overfitting. Fully Connected Layer: Outputs logits corresponding to the vocabulary size for predicting the next token. Training Process The model is trained over 1500 epochs using the AdamW optimizer with learning rate scheduling through OneCycleLR. During training:

Train and Validation Loss: The model computes the cross-entropy loss for both training and validation sets at regular intervals. Perplexity: The model evaluates the perplexity on the validation set to measure how well it is performing in terms of next-token prediction. Checkpointing: Model checkpoints are saved at every evaluation interval, and the best model based on validation loss is saved separately. Results The model shows a significant decrease in both training and validation loss throughout the training process. Perplexity, a measure of how well the model predicts the next word in a sequence, also decreases from over 32,000 at the start to 1.23 at the end. This low perplexity indicates that the model has learned to predict the next word accurately and has achieved strong performance.

Perplexity Results Epoch Train Loss Val Loss Perplexity 0 10.3970 10.3764 32093.4629 300 1.4850 1.1671 3.2126 600 0.2610 0.2571 1.2932 900 0.2240 0.2210 1.2473 1200 0.2152 0.2092 1.2327