NLP Model Training with Simple RNN

Overview

This repository contains the implementation of a simple Recurrent Neural Network (RNN) model for natural language processing tasks, trained on a cleaned dataset from multiple sources. The model is designed to generate text based on input sequences, showcasing capabilities in language modeling.

Features

RNN Architecture: A straightforward RNN model for text generation.
Custom Tokenizer: Utilizes the GPT-2 tokenizer for encoding and decoding text.
Data Processing: Combines and cleans datasets from multiple sources to create a robust training corpus.
Checkpointing: Regularly saves model checkpoints during training for easy recovery.
Performance Evaluation: Tracks training and validation losses and calculates perplexity scores.

Getting Started

Installation

Make sure to install the required libraries by running:

pip install torch torchvision transformers pandas matplotlib huggingface-hub

### Data Preparation
The datasets used in this project are:

- English Dataset: Alpaca Cleaned Dataset
- Aymara Dataset: Aymara text data in JSON format stored in Google Drive.
### Training the Model
To train the model, simply run the main training script. The model's hyperparameters can be modified in the script as needed.

```bash
# Start training
python train.py  # Adjust to your script's name
Generated Text
After training, you can generate text using the trained model. Here’s an example of generating text with a context:


context = "Once upon a time"
generated_text = model.generate(context, max_new_tokens=200)
print(generated_text)
Performance Metrics
Training Loss: Logged every 300 iterations.
Validation Loss: Also logged every 300 iterations.
Perplexity: Calculated from validation loss to assess model performance.
Results Visualization
Training and validation losses are plotted for better understanding and visualization of the training process.

python
Copy code
import matplotlib.pyplot as plt

# Read and plot the loss data
plt.plot(loss_data['epoch/step'], loss_data['training_loss'], label='Training Loss')
plt.plot(loss_data['epoch/step'], loss_data['val_loss'], label='Validation Loss')
plt.xscale('log')
plt.title('Training and Validation Loss Over Iterations')
plt.xlabel('Epoch/Step')
plt.ylabel('Loss')
plt.legend()
plt.grid()
plt.show()
Checkpoints
The trained model checkpoints are saved periodically during training. You can load them to continue training or for inference.