Edit model card

distilbert-base-uncased-finetuned-imdb-v2

This model is a fine-tuned version of distilbert-base-uncased on the imdb dataset. It achieves the following results on the evaluation set:

  • Loss: 2.3033

Model description

This model is a fine-tuned version of DistilBERT base uncased on the IMDb dataset. It was trained to predict the next word in a sentence using masked language modeling. The model has been fine-tuned to adapt to the language patterns and sentiment present in movie reviews.

Intended uses & limitations

This model is primarily designed for the fill-mask task, a type of language modeling where the model is trained to predict missing words within a given context. It excels at completing sentences or phrases by predicting the most likely missing word based on the surrounding text. This functionality makes it valuable for a wide range of natural language processing tasks, such as generating coherent text, improving auto-completion in writing applications, and enhancing conversational agents' responses. However, it may have limitations in handling domain-specific language or topics not present in the IMDb dataset. Additionally, it may not perform well on languages other than English.

Training and evaluation data

The model was trained on a subset of the IMDb dataset, containing 40,000 reviews for fine-tuning. The evaluation was conducted on a separate test set of 6,000 reviews.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 64
  • eval_batch_size: 64
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 3

Training results

Training Loss Epoch Step Validation Loss
2.4912 1.0 625 2.3564
2.4209 2.0 1250 2.3311
2.4 3.0 1875 2.3038

Framework versions

  • Transformers 4.31.0
  • Pytorch 2.0.1+cu118
  • Datasets 2.14.4
  • Tokenizers 0.13.3

How to use

import torch
import pandas as pd
from transformers import AutoTokenizer, AutoModelForMaskedLM

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Francesco-A/distilbert-base-uncased-finetuned-imdb-v2")
model = AutoModelForMaskedLM.from_pretrained("Francesco-A/distilbert-base-uncased-finetuned-imdb-v2")

# Example sentence
sentence = "This movie is really [MASK]."

# Tokenize the sentence
inputs = tokenizer(sentence, return_tensors="pt")

# Get the model's predictions
with torch.no_grad():
    outputs = model(**inputs)

# Get the top-k predicted tokens and their probabilities
k = 5  # Number of top predictions to retrieve
masked_token_index = inputs["input_ids"].tolist()[0].index(tokenizer.mask_token_id)
predicted_token_logits = outputs.logits[0, masked_token_index]
topk_values, topk_indices = torch.topk(torch.softmax(predicted_token_logits, dim=-1), k)

# Convert top predicted token indices to words
predicted_tokens = [tokenizer.decode(idx.item()) for idx in topk_indices]
# Convert probabilities to Python floats
probs = topk_values.tolist()

# Create a DataFrame to display the top predicted words and probabilities
data = {
    "Predicted Words": predicted_tokens,
    "Probability": probs,
}

df = pd.DataFrame(data)

# Display the DataFrame
df
Downloads last month
2
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Finetuned from

Dataset used to train Francesco-A/distilbert-base-uncased-finetuned-imdb-v2