Instructions to use srajam696/distilbert-goodreads-genres with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use srajam696/distilbert-goodreads-genres with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="srajam696/distilbert-goodreads-genres")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("srajam696/distilbert-goodreads-genres") model = AutoModelForSequenceClassification.from_pretrained("srajam696/distilbert-goodreads-genres") - Notebooks
- Google Colab
- Kaggle
- DistilBERT Fine-Tuned on Goodreads Reviews for Genre Classification
DistilBERT Fine-Tuned on Goodreads Reviews for Genre Classification
Model Details
Model Name: distilbert-goodreads-genres
Base Model: DistilBERT (distilbert-base-cased)
Task: Multi-class Text Classification
Number of Classes: 8 genres
Language: English
Architecture: Transformer-based (DistilBERT)
Parameters: Approximately 66 million
Model Size: 260 MB (fp32)
Maximum Sequence Length: 512 tokens
Framework: PyTorch with Hugging Face Transformers
Training Platform: Kaggle Notebooks with GPU (Tesla T4)
Date Trained: 2024
Intended Use
This model is designed to classify English-language book reviews into one of eight predefined genres. It is intended for:
- Automated genre prediction for book reviews
- Content organization and categorization systems
- Research on literary genre characteristics
- Educational purposes in NLP and machine learning
Supported Genres
The model classifies reviews into these eight genres:
- Poetry
- Children
- Mystery
- Romance
- Science Fiction
- Fantasy
- Horror
- Historical Fiction
Model Performance
The model was evaluated on a test set of 1,600 reviews (200 per genre):
| Metric | Score |
|---|---|
| Accuracy | 89.44% |
| F1 Score (Weighted) | 89.43% |
| Evaluation Loss | 0.3284 |
Dataset: UCSD Goodreads Reviews Dataset
Training Data: 6,400 reviews (800 per genre)
Test Data: 1,600 reviews (200 per genre)
How to Use
Quick Start with Pipeline
from transformers import pipeline
classifier = pipeline("text-classification", model="srajam696/distilbert-goodreads-genres")
review = "This book was absolutely captivating from start to finish. The mystery kept me guessing until the very end."
result = classifier(review)
print(result)
# Output: [{'label': 'LABEL_2', 'score': 0.9876}]
Using Model and Tokenizer Directly
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
import torch
model_name = "srajam696/distilbert-goodreads-genres"
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name)
# Genre mapping
genres = {
0: "Poetry",
1: "Children",
2: "Mystery",
3: "Romance",
4: "Science Fiction",
5: "Fantasy",
6: "Horror",
7: "Historical Fiction"
}
review = "A truly magical world filled with wonder and adventure."
inputs = tokenizer(
review,
truncation=True,
padding=True,
max_length=512,
return_tensors="pt"
)
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predicted_id = logits.argmax(-1).item()
print(f"Predicted Genre: {genres[predicted_id]}")
print(f"Confidence: {torch.softmax(logits, dim=-1).max().item():.4f}")
Batch Processing
reviews = [
"A mysterious tale that kept me on the edge of my seat.",
"The perfect love story for a rainy afternoon.",
"Futuristic technology and mind-bending concepts."
]
inputs = tokenizer(
reviews,
truncation=True,
padding=True,
max_length=512,
return_tensors="pt"
)
with torch.no_grad():
outputs = model(**inputs)
predictions = outputs.logits.argmax(-1)
for review, pred in zip(reviews, predictions):
print(f"Review: {review[:60]}... -> {genres[pred.item()]}")
Training Details
Dataset
Source: UCSD Goodreads Reviews Dataset by Mengting Wan and Julian McAuley
Link: https://mengtingwan.github.io/data/goodreads.html
Data Preparation:
- Downloaded from remote servers using streaming decompression
- Randomly sampled 1,000 reviews per genre
- Split: 800 training (80%) and 200 test (20%) per genre
- Total: 6,400 training samples and 1,600 test samples
Training Configuration
TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
warmup_steps=100,
weight_decay=0.01,
learning_rate=3e-5,
logging_steps=50,
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
report_to="wandb",
run_name="distilbert-run-1"
)
Tokenization
- Tokenizer: DistilBertTokenizerFast
- Model: distilbert-base-cased
- Truncation: Enabled (max 512 tokens)
- Padding: Enabled
- Special tokens: Preserved
Training Setup
- Framework: PyTorch with Hugging Face Transformers
- Optimizer: AdamW
- Loss Function: Cross-Entropy Loss
- Metric: Accuracy and Weighted F1 Score
- Platform: Kaggle Notebooks
- GPU: Tesla T4
- Training Time: Approximately 30 minutes for 3 epochs
Experiment Tracking
All training runs were tracked using Weights and Biases (W&B):
Project: mlops-assignment2
Dashboard: https://wandb.ai/srajam696-charan/mlops-assignment2
Tracked Metrics:
- Training loss (every 50 steps)
- Validation loss (per epoch)
- Accuracy (per epoch)
- F1 score (per epoch)
- Learning rate schedule
- GPU/CPU utilization
- All hyperparameters
Model Architecture
DistilBERT architecture consists of:
- 6 transformer layers (reduced from BERT's 12)
- 768 hidden dimensions
- 12 attention heads
- 3,072 hidden dimensions in feed-forward layers
- Sequence classification head with 8 output units
- Total parameters: 66 million (40% smaller than BERT)
Key improvements over BERT:
- 40% size reduction
- 60% faster inference
- 95% of BERT's performance retained
Limitations and Biases
Known Limitations
Fixed Genre Set: The model can only predict the eight trained genres. Reviews describing other genres will be forced into one of these categories.
Domain Specificity: Trained exclusively on Goodreads reviews. Performance may degrade on other book review sources.
Language: The model is English-specific and may not perform on other languages.
Genre Overlap: Some genres have inherent overlap. The model may struggle to distinguish between similar categories.
Subjectivity: Genre classification is inherently subjective. Disagreement between human annotators would limit model performance.
Bias Considerations
The model's performance varies across genres, reflecting characteristics present in the training data. Goodreads reviews may not represent all reader populations equally. The model should not be used as the sole decision-maker for critical genre classification tasks without human oversight.
Recommendations
For production deployment:
- Maintain human-in-the-loop review for critical applications
- Monitor performance metrics over time
- Regularly audit predictions for bias
- Consider ensemble approaches for improved robustness
- Implement confidence thresholding for uncertain predictions
- Retrain periodically with new data
Environmental Impact
Hardware: Tesla T4 GPU (Kaggle)
Training Duration: Approximately 30 minutes
GPU Utilization: Near-peak during training
Estimated Carbon: Minimal (single training run on shared infrastructure)
Inference: Low-resource (66M parameters, suitable for CPU or edge devices)
Evaluation Results
Overall Performance
- Test Accuracy: 89.44%
- Weighted F1: 89.43%
- Loss: 0.3284
Per-Class Performance
Performance varies across genres due to data characteristics and inherent genre distinctions:
- Strong performance on Mystery, Romance, and Science Fiction
- Moderate performance on Fantasy and Historical Fiction
- Lower performance on Poetry due to genre overlap with Literary Fiction characteristics
Evaluation Metrics
Calculated using scikit-learn:
- accuracy_score()
- f1_score(average='weighted')
- classification_report()
Resources and Links
Model Repository: https://huggingface.co/srajam696/distilbert-goodreads-genres
GitHub Repository: [Your GitHub link]
Kaggle Notebook: https://www.kaggle.com/code/omshivamnlr/mlops2/edit
Weights and Biases: https://wandb.ai/srajam696-charan/mlops-assignment2
Dataset: https://mengtingwan.github.io/data/goodreads.html
DistilBERT Paper: https://arxiv.org/abs/1910.01108
Hugging Face Documentation: https://huggingface.co/docs/transformers/
Citation
If you use this model in your research or applications, please cite:
@model{distilbert_goodreads_genres_2024,
author = {Srajam696},
title = {DistilBERT Fine-Tuned for Goodreads Genre Classification},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/srajam696/distilbert-goodreads-genres}},
}
Dataset Citation
If using the UCSD Goodreads dataset, please cite:
@dataset{wan2019goodreads,
author = {Wan, Mengting and McAuley, Julian},
title = {Fine-grained Analysis of Implicit and Explicit Conversations},
year = {2018},
url = {https://mengtingwan.github.io/data/goodreads.html}
}
License
This model is released under the MIT License. See LICENSE file for details.
Acknowledgments
- UCSD Goodreads Dataset by Mengting Wan and Julian McAuley
- Hugging Face team for transformers library and model hub
- Weights and Biases for experiment tracking
- Kaggle for free GPU compute resources
Contact and Support
For questions, issues, or suggestions:
- Check the model card and documentation
- Review the GitHub repository
- Consult the Kaggle notebook for implementation details
- Access the W&B project for training metrics and logs
Last Updated: 2024
Model Version: 1.0
- Downloads last month
- 4
Paper for srajam696/distilbert-goodreads-genres
Evaluation results
- Accuracy on UCSD Goodreads Reviewstest set self-reported0.894
- F1 (Weighted) on UCSD Goodreads Reviewstest set self-reported0.894
- Loss on UCSD Goodreads Reviewstest set self-reported0.328