DistilBERT Base Cased - Text Processing Model

This repository contains a Jupyter notebook demonstrating the use of DistilBERT, a distilled version of BERT (Bidirectional Encoder Representations from Transformers), for masked language modeling and text embedding generation.

Overview

DistilBERT is a smaller, faster, and lighter version of BERT that retains 97% of BERT's language understanding while being 60% faster and 40% smaller in size. This project demonstrates both the cased and uncased variants of DistilBERT.

Features

Fill-Mask Pipeline: Uses DistilBERT to predict masked tokens in sentences
Word Embeddings: Generates contextual word embeddings for text processing
GPU Support: Configured to run on CUDA-enabled GPUs for faster inference
Easy Integration: Simple examples using Hugging Face Transformers library

Requirements

Python 3.7+
PyTorch
Transformers library
CUDA-compatible GPU (optional, but recommended)

Installation

Install the required dependencies:

pip install -U transformers

For GPU support, ensure you have PyTorch with CUDA installed:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Usage

Fill-Mask Task

from transformers import pipeline

pipe = pipeline("fill-mask", model="distilbert/distilbert-base-cased")
result = pipe("Hello I'm a [MASK] model.")

for candidate in result:
    print(candidate)

Generating Word Embeddings

from transformers import DistilBertTokenizer, DistilBertModel

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained("distilbert-base-uncased")

text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

# Access the embeddings
embeddings = output.last_hidden_state

Direct Model Loading

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-cased")
model = AutoModelForMaskedLM.from_pretrained("distilbert/distilbert-base-cased")

Notebook Contents

The Distilbert-base-cased.ipynb notebook includes:

Installation: Setting up the Transformers library
Pipeline Usage: High-level API for fill-mask tasks
Direct Model Loading: Lower-level API for custom implementations
Embedding Generation: Creating contextual word embeddings
Token Visualization: Inspecting tokenization results

Models Used

distilbert-base-cased: DistilBERT model trained on cased English text
distilbert-base-uncased: DistilBERT model trained on lowercased English text

Model pages:

Example Output

When running the fill-mask task with "Hello I'm a [MASK] model.", the model predicts:

fashion (15.75%)
professional (6.04%)
role (2.56%)
celebrity (1.94%)
model (1.73%)

Use Cases

Text Classification: Sentiment analysis, topic classification
Named Entity Recognition: Identifying entities in text
Question Answering: Building QA systems
Text Embeddings: Feature extraction for downstream tasks
Language Understanding: Transfer learning for NLP tasks

Performance

DistilBERT offers an excellent trade-off between performance and efficiency:

Speed: 60% faster than BERT
Size: 40% smaller than BERT
Performance: Retains 97% of BERT's capabilities

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Issues

If the code snippets do not work, please open an issue on:

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Hugging Face: For the Transformers library and pre-trained models
DistilBERT Authors: Sanh et al. for the DistilBERT research and implementation

References

Contact

For questions or feedback, please open an issue in this repository.