Model Card

Language Model for Next Token Prediction in English and Estonian Model Overview This is a neural network-based language model developed for next-token prediction tasks in English and Estonian. The model leverages RNN and LSTM architectures to achieve effective sequence modeling without relying on transformer-based approaches. It is designed to predict the next word or token based on the context provided by the preceding sequence.

This modelcard aims to be a base template for new models. It has been generated using this raw template.

Model Details

Architecture: The model uses a combination of Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks to handle sequential data efficiently. Key features include:

Multiple hidden layers for enhanced feature representation. Custom embedding layers for bilingual token representations. Optimized hyperparameters for balanced training performance. Training Data: The model is trained on a custom bilingual corpus that includes diverse text from English and Estonian language resources.

Input and Output:

Input: A sequence of tokens in either English or Estonian. Output: The predicted next token in the sequence.

Step 1: Install Required Libraries

!pip install transformers datasets evaluate torch

pip show transformers

pip install --upgrade transformers

Step 2: Import Libraries

import torch from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer from datasets import Dataset import pandas as pd

Step 3: Load Pretrained Model and Tokenizer

model_name = "gpt2" tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token model = AutoModelForCausalLM.from_pretrained(model_name)

Step : Tokenize Dataset

def tokenize_function(examples): inputs = [f"{text}" for text in examples["input"]] outputs = [f"{text}" for text in examples["output"]] model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length") labels = tokenizer(outputs, max_length=512, truncation=True, padding="max_length")["input_ids"] model_inputs["labels"] = labels return model_inputs

train_tokenized = train_dataset.map(tokenize_function, batched=True) val_tokenized = val_dataset.map(tokenize_function, batched=True)

#step 6: training_args = TrainingArguments( output_dir="./fine_tuned_gpt2", eval_strategy="epoch", # Replace evaluation_strategy with eval_strategy learning_rate=5e-5, num_train_epochs=3, per_device_train_batch_size=4, per_device_eval_batch_size=4, save_total_limit=2, save_strategy="epoch", logging_dir="./logs", logging_steps=10, load_best_model_at_end=True, metric_for_best_model="eval_loss", )

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support