Neural Network-Based Language Model for Next Token Prediction

Overview

This project implements a neural network-based language model designed for next-token prediction using two languages: English and Icelandic. The model is built without the use of transformer or encoder-decoder architectures, focusing instead on traditional neural network techniques.

###Table of Contents:

Installation Usage Model Architecture Training Text Generation Results License

###Installation:

To run this project, you need to have Python installed along with the following libraries: pip install torch numpy pandas huggingface_hub

###Usage

Upload or open the notebook in Google Colab. Navigate to Google Colab and open the notebook. Run all cells sequentially to load the models, configure the text generation process, and view outputs. Modify the seed text to generate different text sequences. You can provide your own input to see how the model generates text in response.

##Model Architecture

The model used in this notebook is based on Recurrent Neural Networks (RNN) or Long Short-Term Memory (LSTM) networks, which are commonly used for sequence prediction tasks like text generation. The architecture consists of: Embedding Layer: Converts input words into dense vectors of fixed size. LSTM/GRU Layers: These handle sequential data and maintain long-range dependencies between words. Dense Output Layer: Generates predictions for the next word in the sequence. This architecture helps the model learn from previous words and predict the next one in the sequence effectively.

##Training

The model used for this notebook is pre-trained, meaning it has already been trained on a large dataset for both English and Icelandic text generation. However, if you wish to re-train the model or fine-tune it for your own data, you can do so by adding a training loop in the notebook. Ensure you have a dataset and adjust the training parameters (like batch size, epochs, and learning rate). Here’s a basic outline of how the training could be set up: Preprocess your text data into sequences. Split the data into training and validation sets. Train the model using the sequences, optimizing for the loss function. Save the model after training for future use.

##Text Generation

In this notebook, the model is used for text generation. It works by taking an initial seed text (a starting sequence) and predicting the next word repeatedly to generate a longer sequence.

Steps for text generation: Provide a seed text in English or Icelandic. Run the code cell to generate text based on the provided input. The output will be displayed as a continuation of the seed text.

Example: English Seed Text: "Today is a good day" Generated Output: "Today is a good day to explore the new opportunities available." Icelandic Seed Text: "þetta mun auka" Generated Output: "þetta mun auka áberandi í utan eins og vieigandi..."

##License

License This notebook is available for educational purposes. Feel free to modify and use it as needed for your own experiments or projects. However, the pre-trained models and certain dependencies may have their own licenses, so ensure you comply with their usage policies.

##Results

The training curves for both loss and validation loss are provided in the submission. The model's performance is evaluated based on the generated text quality and perplexity score during training.