Extended BERT-base-NER

Model Description

Extended BERT-base-NER is a fine-tuned BERT model that extends the original bert-base-NER with 10 additional entity types for comprehensive named entity recognition.

Entity Types (14 total)

Original (4):

PER (Person) - Names of people
ORG (Organization) - Company names, institutions
LOC (Location) - Places, cities, countries
MISC (Miscellaneous) - Other named entities

New (10):

MED (Medicine) - Medicine names, drug names
ZIP (Zip Code) - Postal codes, ZIP codes
COUNTRY_CODE - Country codes (US, UK, CA, etc.)
STATE - States, provinces, regions
ETHNICITY - Ethnic groups, cultural backgrounds
RACE - Racial categories
CONTINENT - Continents (North America, Europe, etc.)
TERRITORY - Territories, dependencies
PHONE - Phone numbers
EMAIL - Email addresses

Usage

Using Transformers Pipeline

from transformers import pipeline

# Load the model
nlp = pipeline("ner", model="BikashML/extended-bert-base-ner", aggregation_strategy="simple")

# Example text
text = "Dr. Maria Garcia prescribed Aspirin for the patient from California, USA. Contact her at maria.garcia@hospital.com or call 555-123-4567."

# Get predictions
results = nlp(text)

# Print results
for entity in results:
    print(f"{entity['word']} -> {entity['entity_group']} (confidence: {entity['score']:.3f})")

Expected Output

Dr. Maria Garcia -> PER (confidence: 0.660) Aspirin -> MED (confidence: 0.401) California -> LOC (confidence: 0.261) USA -> STATE (confidence: 0.372) maria.garcia@hospital.com -> EMAIL (confidence: 0.700) 555-123-4567 -> PHONE (confidence: 0.713)

Model Architecture

Base Model: bert-base-cased
Architecture: BertForTokenClassification
Parameters: 110M
Total Labels: 29 (BIO tagging scheme)
Max Sequence Length: 512 tokens

Training Data

This model was trained on:

Base Dataset: CoNLL-2003 Named Entity Recognition dataset
Extended Data: 69 custom annotated examples
Entity Types: All 14 entity types with diverse examples
Training Approach: Fine-tuning from bert-base-NER

Use Cases

Medical Records: Extract patient information, medications, contact details
Business Documents: Identify companies, locations, contact information
Personal Data: Extract names, addresses, phone numbers, emails
Geographic Data: Identify locations, states, countries, territories
Demographic Analysis: Extract ethnicity, race, geographic information

Limitations

Language: English only
Domain: May perform better on domains similar to training data
Entity Boundaries: May occasionally misclassify entity boundaries

Citation

@misc{extended-bert-base-ner,
  title={Extended BERT-base-NER: Multi-domain Named Entity Recognition},
  author={BikashML},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/BikashML/extended-bert-base-ner}
}

License

This model is licensed under the MIT License.

Downloads last month: 13