DistilBERT Fine-Tuned on Goodreads Reviews for Genre Classification

Model Details

Model Name: distilbert-goodreads-genres

Base Model: DistilBERT (distilbert-base-cased)

Task: Multi-class Text Classification

Number of Classes: 8 genres

Language: English

Architecture: Transformer-based (DistilBERT)

Parameters: Approximately 66 million

Model Size: 260 MB (fp32)

Maximum Sequence Length: 512 tokens

Framework: PyTorch with Hugging Face Transformers

Training Platform: Kaggle Notebooks with GPU (Tesla T4)

Date Trained: 2024

Intended Use

This model is designed to classify English-language book reviews into one of eight predefined genres. It is intended for:

Automated genre prediction for book reviews
Content organization and categorization systems
Research on literary genre characteristics
Educational purposes in NLP and machine learning

Supported Genres

The model classifies reviews into these eight genres:

Poetry
Children
Mystery
Romance
Science Fiction
Fantasy
Horror
Historical Fiction

Model Performance

The model was evaluated on a test set of 1,600 reviews (200 per genre):

Metric	Score
Accuracy	89.44%
F1 Score (Weighted)	89.43%
Evaluation Loss	0.3284

Dataset: UCSD Goodreads Reviews Dataset

Training Data: 6,400 reviews (800 per genre)

Test Data: 1,600 reviews (200 per genre)

How to Use

Quick Start with Pipeline

from transformers import pipeline

classifier = pipeline("text-classification", model="srajam696/distilbert-goodreads-genres")

review = "This book was absolutely captivating from start to finish. The mystery kept me guessing until the very end."
result = classifier(review)
print(result)
# Output: [{'label': 'LABEL_2', 'score': 0.9876}]

Using Model and Tokenizer Directly

from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
import torch

model_name = "srajam696/distilbert-goodreads-genres"
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name)

# Genre mapping
genres = {
    0: "Poetry",
    1: "Children",
    2: "Mystery",
    3: "Romance",
    4: "Science Fiction",
    5: "Fantasy",
    6: "Horror",
    7: "Historical Fiction"
}

review = "A truly magical world filled with wonder and adventure."
inputs = tokenizer(
    review, 
    truncation=True, 
    padding=True, 
    max_length=512, 
    return_tensors="pt"
)

with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_id = logits.argmax(-1).item()

print(f"Predicted Genre: {genres[predicted_id]}")
print(f"Confidence: {torch.softmax(logits, dim=-1).max().item():.4f}")

Batch Processing

reviews = [
    "A mysterious tale that kept me on the edge of my seat.",
    "The perfect love story for a rainy afternoon.",
    "Futuristic technology and mind-bending concepts."
]

inputs = tokenizer(
    reviews, 
    truncation=True, 
    padding=True, 
    max_length=512, 
    return_tensors="pt"
)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = outputs.logits.argmax(-1)

for review, pred in zip(reviews, predictions):
    print(f"Review: {review[:60]}... -> {genres[pred.item()]}")

Training Details

Dataset

Source: UCSD Goodreads Reviews Dataset by Mengting Wan and Julian McAuley

Link: https://mengtingwan.github.io/data/goodreads.html

Data Preparation:

Downloaded from remote servers using streaming decompression
Randomly sampled 1,000 reviews per genre
Split: 800 training (80%) and 200 test (20%) per genre
Total: 6,400 training samples and 1,600 test samples

Training Configuration

TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    weight_decay=0.01,
    learning_rate=3e-5,
    logging_steps=50,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="wandb",
    run_name="distilbert-run-1"
)

Tokenization

Tokenizer: DistilBertTokenizerFast
Model: distilbert-base-cased
Truncation: Enabled (max 512 tokens)
Padding: Enabled
Special tokens: Preserved

Training Setup

Framework: PyTorch with Hugging Face Transformers
Optimizer: AdamW
Loss Function: Cross-Entropy Loss
Metric: Accuracy and Weighted F1 Score
Platform: Kaggle Notebooks
GPU: Tesla T4
Training Time: Approximately 30 minutes for 3 epochs

Experiment Tracking

All training runs were tracked using Weights and Biases (W&B):

Project: mlops-assignment2

Dashboard: https://wandb.ai/srajam696-charan/mlops-assignment2

Tracked Metrics:

Training loss (every 50 steps)
Validation loss (per epoch)
Accuracy (per epoch)
F1 score (per epoch)
Learning rate schedule
GPU/CPU utilization
All hyperparameters

Model Architecture

DistilBERT architecture consists of:

6 transformer layers (reduced from BERT's 12)
768 hidden dimensions
12 attention heads
3,072 hidden dimensions in feed-forward layers
Sequence classification head with 8 output units
Total parameters: 66 million (40% smaller than BERT)

Key improvements over BERT:

40% size reduction
60% faster inference
95% of BERT's performance retained

Limitations and Biases

Known Limitations

Fixed Genre Set: The model can only predict the eight trained genres. Reviews describing other genres will be forced into one of these categories.
Domain Specificity: Trained exclusively on Goodreads reviews. Performance may degrade on other book review sources.
Language: The model is English-specific and may not perform on other languages.
Genre Overlap: Some genres have inherent overlap. The model may struggle to distinguish between similar categories.
Subjectivity: Genre classification is inherently subjective. Disagreement between human annotators would limit model performance.

Bias Considerations

The model's performance varies across genres, reflecting characteristics present in the training data. Goodreads reviews may not represent all reader populations equally. The model should not be used as the sole decision-maker for critical genre classification tasks without human oversight.

Recommendations

For production deployment:

Maintain human-in-the-loop review for critical applications
Monitor performance metrics over time
Regularly audit predictions for bias
Consider ensemble approaches for improved robustness
Implement confidence thresholding for uncertain predictions
Retrain periodically with new data

Environmental Impact

Hardware: Tesla T4 GPU (Kaggle)

Training Duration: Approximately 30 minutes

GPU Utilization: Near-peak during training

Estimated Carbon: Minimal (single training run on shared infrastructure)

Inference: Low-resource (66M parameters, suitable for CPU or edge devices)

Evaluation Results

Overall Performance

Test Accuracy: 89.44%
Weighted F1: 89.43%
Loss: 0.3284

Per-Class Performance

Performance varies across genres due to data characteristics and inherent genre distinctions:

Strong performance on Mystery, Romance, and Science Fiction
Moderate performance on Fantasy and Historical Fiction
Lower performance on Poetry due to genre overlap with Literary Fiction characteristics

Evaluation Metrics

Calculated using scikit-learn:

accuracy_score()
f1_score(average='weighted')
classification_report()

Resources and Links

Model Repository: https://huggingface.co/srajam696/distilbert-goodreads-genres

GitHub Repository: [Your GitHub link]

Kaggle Notebook: https://www.kaggle.com/code/omshivamnlr/mlops2/edit

Weights and Biases: https://wandb.ai/srajam696-charan/mlops-assignment2

Dataset: https://mengtingwan.github.io/data/goodreads.html

DistilBERT Paper: https://arxiv.org/abs/1910.01108

Hugging Face Documentation: https://huggingface.co/docs/transformers/

Citation

If you use this model in your research or applications, please cite:

@model{distilbert_goodreads_genres_2024,
  author = {Srajam696},
  title = {DistilBERT Fine-Tuned for Goodreads Genre Classification},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/srajam696/distilbert-goodreads-genres}},
}

Dataset Citation

If using the UCSD Goodreads dataset, please cite:

@dataset{wan2019goodreads,
  author = {Wan, Mengting and McAuley, Julian},
  title = {Fine-grained Analysis of Implicit and Explicit Conversations},
  year = {2018},
  url = {https://mengtingwan.github.io/data/goodreads.html}
}

License

This model is released under the MIT License. See LICENSE file for details.

Acknowledgments

UCSD Goodreads Dataset by Mengting Wan and Julian McAuley
Hugging Face team for transformers library and model hub
Weights and Biases for experiment tracking
Kaggle for free GPU compute resources

Contact and Support

For questions, issues, or suggestions:

Check the model card and documentation
Review the GitHub repository
Consult the Kaggle notebook for implementation details
Access the W&B project for training metrics and logs

Last Updated: 2024

Model Version: 1.0

Downloads last month: 4

Safetensors

Model size

65.8M params

Tensor type

F32

Paper for srajam696/distilbert-goodreads-genres

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Paper • 1910.01108 • Published Oct 2, 2019 • 23

Evaluation results

Accuracy on UCSD Goodreads Reviews
test set self-reported

0.894
F1 (Weighted) on UCSD Goodreads Reviews
test set self-reported

0.894
Loss on UCSD Goodreads Reviews
test set self-reported

0.328