Model Card for Fine-tuned DistilBERT on CoNLL-PP for Named Entity Recognition
Model Details
Model Description
This model is a fine-tuned version of DistilBERT-Base-Cased for Named Entity Recognition (NER) using the CoNLL-PP dataset. The model identifies entities in text such as locations, organizations, persons, etc.
- Developed by: shogun-the-great
- Model type: Token Classification (NER)
- Language(s): English
- License: Apache-2.0 (or specify your license)
- Finetuned from model:
distilbert-base-cased
Model Sources
- Dataset: conllpp
Uses
Direct Use
This model can be directly used for Named Entity Recognition tasks to extract entities such as locations, organizations, and persons from text. Typical use cases include:
- NER for document analysis.
- Text classification and entity extraction in information retrieval systems.
- Integration with chatbots to identify user-specific entities.
Downstream Use
This model can be further fine-tuned for specific tasks requiring NER in specific domains like medical, legal, etc.
Out-of-Scope Use
This model might not perform well for:
- Non-English text.
- Extremely noisy or unstructured text where entities are not clearly defined.
Bias, Risks, and Limitations
Bias
The model’s predictions are influenced by the dataset used during fine-tuning. If the dataset contains biases, they may be reflected in the predictions.
Risks
- False positives: Incorrectly identified entities (e.g., non-person names identified as persons).
- False negatives: Important entities missed.
- Limited generalization to non-CoNLL-PP entity types or domains.
Recommendations
- Regularly update the model with new data for better generalization.
- Review and monitor predictions to ensure the accuracy of identified entities.
How to Get Started with the Model
You can load the fine-tuned model directly from the Hugging Face Hub:
from transformers import AutoTokenizer, AutoModelForTokenClassification
# Load the tokenizer and model from Hugging Face Hub
model_name = "shogun-the-great/finetuned-distilbert-connllp"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Example usage for NER
text = "Barack Obama was born in Hawaii."
inputs = tokenizer(text, return_tensors="pt", truncation=True, is_split_into_words=True)
outputs = model(**inputs)
# Get the predicted labels for each token
predictions = outputs.logits.argmax(dim=-1)
tokens = inputs.tokens()
# Convert predictions to entity names
predicted_tags = [model.config.id2label[prediction.item()] for prediction in predictions[0]]
for token, tag in zip(tokens, predicted_tags):
print(f"{token}: {tag}")
- Downloads last month
- 102