DistilBERT Base Cased - Text Processing Model
This repository contains a Jupyter notebook demonstrating the use of DistilBERT, a distilled version of BERT (Bidirectional Encoder Representations from Transformers), for masked language modeling and text embedding generation.
Overview
DistilBERT is a smaller, faster, and lighter version of BERT that retains 97% of BERT's language understanding while being 60% faster and 40% smaller in size. This project demonstrates both the cased and uncased variants of DistilBERT.
Features
- Fill-Mask Pipeline: Uses DistilBERT to predict masked tokens in sentences
- Word Embeddings: Generates contextual word embeddings for text processing
- GPU Support: Configured to run on CUDA-enabled GPUs for faster inference
- Easy Integration: Simple examples using Hugging Face Transformers library
Requirements
- Python 3.7+
- PyTorch
- Transformers library
- CUDA-compatible GPU (optional, but recommended)
Installation
Install the required dependencies:
pip install -U transformers
For GPU support, ensure you have PyTorch with CUDA installed:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Usage
Fill-Mask Task
from transformers import pipeline
pipe = pipeline("fill-mask", model="distilbert/distilbert-base-cased")
result = pipe("Hello I'm a [MASK] model.")
for candidate in result:
print(candidate)
Generating Word Embeddings
from transformers import DistilBertTokenizer, DistilBertModel
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained("distilbert-base-uncased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
# Access the embeddings
embeddings = output.last_hidden_state
Direct Model Loading
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-cased")
model = AutoModelForMaskedLM.from_pretrained("distilbert/distilbert-base-cased")
Notebook Contents
The Distilbert-base-cased.ipynb notebook includes:
- Installation: Setting up the Transformers library
- Pipeline Usage: High-level API for fill-mask tasks
- Direct Model Loading: Lower-level API for custom implementations
- Embedding Generation: Creating contextual word embeddings
- Token Visualization: Inspecting tokenization results
Models Used
- distilbert-base-cased: DistilBERT model trained on cased English text
- distilbert-base-uncased: DistilBERT model trained on lowercased English text
Model pages:
Example Output
When running the fill-mask task with "Hello I'm a [MASK] model.", the model predicts:
- fashion (15.75%)
- professional (6.04%)
- role (2.56%)
- celebrity (1.94%)
- model (1.73%)
Use Cases
- Text Classification: Sentiment analysis, topic classification
- Named Entity Recognition: Identifying entities in text
- Question Answering: Building QA systems
- Text Embeddings: Feature extraction for downstream tasks
- Language Understanding: Transfer learning for NLP tasks
Performance
DistilBERT offers an excellent trade-off between performance and efficiency:
- Speed: 60% faster than BERT
- Size: 40% smaller than BERT
- Performance: Retains 97% of BERT's capabilities
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Issues
If the code snippets do not work, please open an issue on:
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Hugging Face: For the Transformers library and pre-trained models
- DistilBERT Authors: Sanh et al. for the DistilBERT research and implementation
References
Contact
For questions or feedback, please open an issue in this repository.