DistilBERT Fine-Tuned on Goodreads Reviews for Genre Classification

Model Details

Model Name: distilbert-goodreads-genres

Base Model: DistilBERT (distilbert-base-cased)

Task: Multi-class Text Classification

Number of Classes: 8 genres

Language: English

Architecture: Transformer-based (DistilBERT)

Parameters: Approximately 66 million

Model Size: 260 MB (fp32)

Maximum Sequence Length: 512 tokens

Framework: PyTorch with Hugging Face Transformers

Training Platform: Kaggle Notebooks with GPU (Tesla T4)

Date Trained: 2024


Intended Use

This model is designed to classify English-language book reviews into one of eight predefined genres. It is intended for:

  • Automated genre prediction for book reviews
  • Content organization and categorization systems
  • Research on literary genre characteristics
  • Educational purposes in NLP and machine learning

Supported Genres

The model classifies reviews into these eight genres:

  1. Poetry
  2. Children
  3. Mystery
  4. Romance
  5. Science Fiction
  6. Fantasy
  7. Horror
  8. Historical Fiction

Model Performance

The model was evaluated on a test set of 1,600 reviews (200 per genre):

Metric Score
Accuracy 89.44%
F1 Score (Weighted) 89.43%
Evaluation Loss 0.3284

Dataset: UCSD Goodreads Reviews Dataset

Training Data: 6,400 reviews (800 per genre)

Test Data: 1,600 reviews (200 per genre)


How to Use

Quick Start with Pipeline

from transformers import pipeline

classifier = pipeline("text-classification", model="srajam696/distilbert-goodreads-genres")

review = "This book was absolutely captivating from start to finish. The mystery kept me guessing until the very end."
result = classifier(review)
print(result)
# Output: [{'label': 'LABEL_2', 'score': 0.9876}]

Using Model and Tokenizer Directly

from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
import torch

model_name = "srajam696/distilbert-goodreads-genres"
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name)

# Genre mapping
genres = {
    0: "Poetry",
    1: "Children",
    2: "Mystery",
    3: "Romance",
    4: "Science Fiction",
    5: "Fantasy",
    6: "Horror",
    7: "Historical Fiction"
}

review = "A truly magical world filled with wonder and adventure."
inputs = tokenizer(
    review, 
    truncation=True, 
    padding=True, 
    max_length=512, 
    return_tensors="pt"
)

with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_id = logits.argmax(-1).item()

print(f"Predicted Genre: {genres[predicted_id]}")
print(f"Confidence: {torch.softmax(logits, dim=-1).max().item():.4f}")

Batch Processing

reviews = [
    "A mysterious tale that kept me on the edge of my seat.",
    "The perfect love story for a rainy afternoon.",
    "Futuristic technology and mind-bending concepts."
]

inputs = tokenizer(
    reviews, 
    truncation=True, 
    padding=True, 
    max_length=512, 
    return_tensors="pt"
)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = outputs.logits.argmax(-1)

for review, pred in zip(reviews, predictions):
    print(f"Review: {review[:60]}... -> {genres[pred.item()]}")

Training Details

Dataset

Source: UCSD Goodreads Reviews Dataset by Mengting Wan and Julian McAuley

Link: https://mengtingwan.github.io/data/goodreads.html

Data Preparation:

  • Downloaded from remote servers using streaming decompression
  • Randomly sampled 1,000 reviews per genre
  • Split: 800 training (80%) and 200 test (20%) per genre
  • Total: 6,400 training samples and 1,600 test samples

Training Configuration

TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    weight_decay=0.01,
    learning_rate=3e-5,
    logging_steps=50,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="wandb",
    run_name="distilbert-run-1"
)

Tokenization

  • Tokenizer: DistilBertTokenizerFast
  • Model: distilbert-base-cased
  • Truncation: Enabled (max 512 tokens)
  • Padding: Enabled
  • Special tokens: Preserved

Training Setup

  • Framework: PyTorch with Hugging Face Transformers
  • Optimizer: AdamW
  • Loss Function: Cross-Entropy Loss
  • Metric: Accuracy and Weighted F1 Score
  • Platform: Kaggle Notebooks
  • GPU: Tesla T4
  • Training Time: Approximately 30 minutes for 3 epochs

Experiment Tracking

All training runs were tracked using Weights and Biases (W&B):

Project: mlops-assignment2

Dashboard: https://wandb.ai/srajam696-charan/mlops-assignment2

Tracked Metrics:

  • Training loss (every 50 steps)
  • Validation loss (per epoch)
  • Accuracy (per epoch)
  • F1 score (per epoch)
  • Learning rate schedule
  • GPU/CPU utilization
  • All hyperparameters

Model Architecture

DistilBERT architecture consists of:

  • 6 transformer layers (reduced from BERT's 12)
  • 768 hidden dimensions
  • 12 attention heads
  • 3,072 hidden dimensions in feed-forward layers
  • Sequence classification head with 8 output units
  • Total parameters: 66 million (40% smaller than BERT)

Key improvements over BERT:

  • 40% size reduction
  • 60% faster inference
  • 95% of BERT's performance retained

Limitations and Biases

Known Limitations

  1. Fixed Genre Set: The model can only predict the eight trained genres. Reviews describing other genres will be forced into one of these categories.

  2. Domain Specificity: Trained exclusively on Goodreads reviews. Performance may degrade on other book review sources.

  3. Language: The model is English-specific and may not perform on other languages.

  4. Genre Overlap: Some genres have inherent overlap. The model may struggle to distinguish between similar categories.

  5. Subjectivity: Genre classification is inherently subjective. Disagreement between human annotators would limit model performance.

Bias Considerations

The model's performance varies across genres, reflecting characteristics present in the training data. Goodreads reviews may not represent all reader populations equally. The model should not be used as the sole decision-maker for critical genre classification tasks without human oversight.

Recommendations

For production deployment:

  • Maintain human-in-the-loop review for critical applications
  • Monitor performance metrics over time
  • Regularly audit predictions for bias
  • Consider ensemble approaches for improved robustness
  • Implement confidence thresholding for uncertain predictions
  • Retrain periodically with new data

Environmental Impact

Hardware: Tesla T4 GPU (Kaggle)

Training Duration: Approximately 30 minutes

GPU Utilization: Near-peak during training

Estimated Carbon: Minimal (single training run on shared infrastructure)

Inference: Low-resource (66M parameters, suitable for CPU or edge devices)


Evaluation Results

Overall Performance

  • Test Accuracy: 89.44%
  • Weighted F1: 89.43%
  • Loss: 0.3284

Per-Class Performance

Performance varies across genres due to data characteristics and inherent genre distinctions:

  • Strong performance on Mystery, Romance, and Science Fiction
  • Moderate performance on Fantasy and Historical Fiction
  • Lower performance on Poetry due to genre overlap with Literary Fiction characteristics

Evaluation Metrics

Calculated using scikit-learn:

  • accuracy_score()
  • f1_score(average='weighted')
  • classification_report()

Resources and Links

Model Repository: https://huggingface.co/srajam696/distilbert-goodreads-genres

GitHub Repository: [Your GitHub link]

Kaggle Notebook: https://www.kaggle.com/code/omshivamnlr/mlops2/edit

Weights and Biases: https://wandb.ai/srajam696-charan/mlops-assignment2

Dataset: https://mengtingwan.github.io/data/goodreads.html

DistilBERT Paper: https://arxiv.org/abs/1910.01108

Hugging Face Documentation: https://huggingface.co/docs/transformers/


Citation

If you use this model in your research or applications, please cite:

@model{distilbert_goodreads_genres_2024,
  author = {Srajam696},
  title = {DistilBERT Fine-Tuned for Goodreads Genre Classification},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/srajam696/distilbert-goodreads-genres}},
}

Dataset Citation

If using the UCSD Goodreads dataset, please cite:

@dataset{wan2019goodreads,
  author = {Wan, Mengting and McAuley, Julian},
  title = {Fine-grained Analysis of Implicit and Explicit Conversations},
  year = {2018},
  url = {https://mengtingwan.github.io/data/goodreads.html}
}

License

This model is released under the MIT License. See LICENSE file for details.


Acknowledgments

  • UCSD Goodreads Dataset by Mengting Wan and Julian McAuley
  • Hugging Face team for transformers library and model hub
  • Weights and Biases for experiment tracking
  • Kaggle for free GPU compute resources

Contact and Support

For questions, issues, or suggestions:

  • Check the model card and documentation
  • Review the GitHub repository
  • Consult the Kaggle notebook for implementation details
  • Access the W&B project for training metrics and logs

Last Updated: 2024

Model Version: 1.0

Downloads last month
4
Safetensors
Model size
65.8M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for srajam696/distilbert-goodreads-genres

Evaluation results