ahczhg's picture
Upload README.md with huggingface_hub
4fd2bc4 verified

DistilBERT Base Cased - Text Processing Model

This repository contains a Jupyter notebook demonstrating the use of DistilBERT, a distilled version of BERT (Bidirectional Encoder Representations from Transformers), for masked language modeling and text embedding generation.

Overview

DistilBERT is a smaller, faster, and lighter version of BERT that retains 97% of BERT's language understanding while being 60% faster and 40% smaller in size. This project demonstrates both the cased and uncased variants of DistilBERT.

Features

  • Fill-Mask Pipeline: Uses DistilBERT to predict masked tokens in sentences
  • Word Embeddings: Generates contextual word embeddings for text processing
  • GPU Support: Configured to run on CUDA-enabled GPUs for faster inference
  • Easy Integration: Simple examples using Hugging Face Transformers library

Requirements

  • Python 3.7+
  • PyTorch
  • Transformers library
  • CUDA-compatible GPU (optional, but recommended)

Installation

Install the required dependencies:

pip install -U transformers

For GPU support, ensure you have PyTorch with CUDA installed:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Usage

Fill-Mask Task

from transformers import pipeline

pipe = pipeline("fill-mask", model="distilbert/distilbert-base-cased")
result = pipe("Hello I'm a [MASK] model.")

for candidate in result:
    print(candidate)

Generating Word Embeddings

from transformers import DistilBertTokenizer, DistilBertModel

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained("distilbert-base-uncased")

text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

# Access the embeddings
embeddings = output.last_hidden_state

Direct Model Loading

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-cased")
model = AutoModelForMaskedLM.from_pretrained("distilbert/distilbert-base-cased")

Notebook Contents

The Distilbert-base-cased.ipynb notebook includes:

  1. Installation: Setting up the Transformers library
  2. Pipeline Usage: High-level API for fill-mask tasks
  3. Direct Model Loading: Lower-level API for custom implementations
  4. Embedding Generation: Creating contextual word embeddings
  5. Token Visualization: Inspecting tokenization results

Models Used

  • distilbert-base-cased: DistilBERT model trained on cased English text
  • distilbert-base-uncased: DistilBERT model trained on lowercased English text

Model pages:

Example Output

When running the fill-mask task with "Hello I'm a [MASK] model.", the model predicts:

  1. fashion (15.75%)
  2. professional (6.04%)
  3. role (2.56%)
  4. celebrity (1.94%)
  5. model (1.73%)

Use Cases

  • Text Classification: Sentiment analysis, topic classification
  • Named Entity Recognition: Identifying entities in text
  • Question Answering: Building QA systems
  • Text Embeddings: Feature extraction for downstream tasks
  • Language Understanding: Transfer learning for NLP tasks

Performance

DistilBERT offers an excellent trade-off between performance and efficiency:

  • Speed: 60% faster than BERT
  • Size: 40% smaller than BERT
  • Performance: Retains 97% of BERT's capabilities

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Issues

If the code snippets do not work, please open an issue on:

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Hugging Face: For the Transformers library and pre-trained models
  • DistilBERT Authors: Sanh et al. for the DistilBERT research and implementation

References

Contact

For questions or feedback, please open an issue in this repository.