Model Card for Fine-tuned DistilBERT on CoNLL-PP for Named Entity Recognition

Model Details

Model Description

This model is a fine-tuned version of DistilBERT-Base-Cased for Named Entity Recognition (NER) using the CoNLL-PP dataset. The model identifies entities in text such as locations, organizations, persons, etc.

Developed by: shogun-the-great
Model type: Token Classification (NER)
Language(s): English
License: Apache-2.0 (or specify your license)
Finetuned from model: distilbert-base-cased

Model Sources

Dataset: conllpp

Uses

Direct Use

This model can be directly used for Named Entity Recognition tasks to extract entities such as locations, organizations, and persons from text. Typical use cases include:

NER for document analysis.
Text classification and entity extraction in information retrieval systems.
Integration with chatbots to identify user-specific entities.

Downstream Use

This model can be further fine-tuned for specific tasks requiring NER in specific domains like medical, legal, etc.

Out-of-Scope Use

This model might not perform well for:

Non-English text.
Extremely noisy or unstructured text where entities are not clearly defined.

Bias, Risks, and Limitations

Bias

The model’s predictions are influenced by the dataset used during fine-tuning. If the dataset contains biases, they may be reflected in the predictions.

Risks

False positives: Incorrectly identified entities (e.g., non-person names identified as persons).
False negatives: Important entities missed.
Limited generalization to non-CoNLL-PP entity types or domains.

Recommendations

Regularly update the model with new data for better generalization.
Review and monitor predictions to ensure the accuracy of identified entities.

How to Get Started with the Model

You can load the fine-tuned model directly from the Hugging Face Hub:

from transformers import AutoTokenizer, AutoModelForTokenClassification

# Load the tokenizer and model from Hugging Face Hub
model_name = "shogun-the-great/finetuned-distilbert-connllp"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Example usage for NER
text = "Barack Obama was born in Hawaii."

inputs = tokenizer(text, return_tensors="pt", truncation=True, is_split_into_words=True)
outputs = model(**inputs)

# Get the predicted labels for each token
predictions = outputs.logits.argmax(dim=-1)
tokens = inputs.tokens()

# Convert predictions to entity names
predicted_tags = [model.config.id2label[prediction.item()] for prediction in predictions[0]]
for token, tag in zip(tokens, predicted_tags):
    print(f"{token}: {tag}")

shogun-the-great
/

finetuned-distilbert-connllp