YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Model Card
Language Model for Next Token Prediction in English and Estonian Model Overview This is a neural network-based language model developed for next-token prediction tasks in English and Estonian. The model leverages RNN and LSTM architectures to achieve effective sequence modeling without relying on transformer-based approaches. It is designed to predict the next word or token based on the context provided by the preceding sequence.
This modelcard aims to be a base template for new models. It has been generated using this raw template.
Model Details
Architecture: The model uses a combination of Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks to handle sequential data efficiently. Key features include:
Multiple hidden layers for enhanced feature representation. Custom embedding layers for bilingual token representations. Optimized hyperparameters for balanced training performance. Training Data: The model is trained on a custom bilingual corpus that includes diverse text from English and Estonian language resources.
Input and Output:
Input: A sequence of tokens in either English or Estonian. Output: The predicted next token in the sequence.
Step 1: Install Required Libraries
!pip install transformers datasets evaluate torch
pip show transformers
pip install --upgrade transformers
Step 2: Import Libraries
import torch from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer from datasets import Dataset import pandas as pd
Step 3: Load Pretrained Model and Tokenizer
model_name = "gpt2" tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token model = AutoModelForCausalLM.from_pretrained(model_name)
Step : Tokenize Dataset
def tokenize_function(examples): inputs = [f"{text}" for text in examples["input"]] outputs = [f"{text}" for text in examples["output"]] model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length") labels = tokenizer(outputs, max_length=512, truncation=True, padding="max_length")["input_ids"] model_inputs["labels"] = labels return model_inputs
train_tokenized = train_dataset.map(tokenize_function, batched=True) val_tokenized = val_dataset.map(tokenize_function, batched=True)
#step 6: training_args = TrainingArguments( output_dir="./fine_tuned_gpt2", eval_strategy="epoch", # Replace evaluation_strategy with eval_strategy learning_rate=5e-5, num_train_epochs=3, per_device_train_batch_size=4, per_device_eval_batch_size=4, save_total_limit=2, save_strategy="epoch", logging_dir="./logs", logging_steps=10, load_best_model_at_end=True, metric_for_best_model="eval_loss", )