library_name: transformers
tags:
- Persian
- Named Entity Recognition
- NER
- Albert
Model Card for Behpoyan-NER
Behpoyan-NER is a fine-tuned Albert model for Named Entity Recognition (NER) in the Persian language. It is based on the HooshvareLab/albert-fa-zwnj-base-v2-ner
model and identifies ten types of entities: Date (DAT), Event (EVE), Facility (FAC), Location (LOC), Money (MON), Organization (ORG), Percent (PCT), Person (PER), Product (PRO), and Time (TIM).
Model Details
Model Description
Behpoyan-NER is designed to recognize named entities in Persian text, improving upon the capabilities of its base model, HooshvareLab/albert-fa-zwnj-base-v2-ner
. It was fine-tuned on a dataset combining ARMAN, PEYMA, and WikiANN datasets, which are widely used for NER in the Persian language.
- Developed by: Behpoyan
- Model type: Albert for Token Classification
- Language(s) (NLP): Persian (fa)
- License: MIT
Model Sources
- Repository: Behpoyan/Behpoyan-NER
- Base Model Repository: HooshvareLab/albert-fa-zwnj-base-v2-ner
Direct Use
This model can be directly used for Named Entity Recognition tasks in Persian text. Example applications include text analysis, information extraction, and Persian-language NLP applications.
Downstream Use
The model can be fine-tuned further for domain-specific NER tasks or combined with other models for complex NLP pipelines.
Out-of-Scope Use
The model is not designed for languages other than Persian or tasks outside token classification. Misuse for generating biased or harmful content is discouraged.
Recommendations
While the model performs well for general-purpose NER in Persian, users should validate its performance on their specific datasets. Be cautious of biases in the training data, especially in identifying less-represented entities.
How to Get Started with the Model
Here’s how you can use the model:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("Behpouyan/Behpouyan-NER")
model = AutoModelForTokenClassification.from_pretrained("Behpouyan/Behpouyan-NER")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
# Input example
example = '''
"در سال ۱۴۰۱، شرکت علیبابا اعلام کرد که با همکاری بانک ملت، یک پروژه بزرگ برای توسعه زیرساختهای تجارت الکترونیک در ایران آغاز خواهد کرد.
این پروژه در تهران و اصفهان اجرا میشود و پیشبینی میشود تا پایان سال ۱۴۰۲ تکمیل شود."
'''
# Get NER results
ner_results = nlp(example)
# Function to merge subword entities
def merge_entities(entities):
merged_results = []
current_entity = None
for entity in entities:
if entity['entity'].startswith("B-") or current_entity is None:
# Start a new entity
if current_entity:
merged_results.append(current_entity)
current_entity = {
"word": entity['word'].strip(),
"entity": entity['entity'][2:], # Remove "B-" prefix
"score": entity['score'],
"start": entity['start'],
"end": entity['end'],
}
elif entity['entity'].startswith("I-") and current_entity:
# Continue the current entity
current_entity['word'] += entity['word'].strip()
current_entity['score'] = min(current_entity['score'], entity['score']) # Use the lowest score
current_entity['end'] = entity['end']
# Add the last entity if any
if current_entity:
merged_results.append(current_entity)
return merged_results
# Merge the entities
merged_results = merge_entities(ner_results)
# Display the merged results
print("Named Entity Recognition Results:")
for entity in merged_results:
print(f"- Entity: {entity['word']}")
print(f" Type: {entity['entity']}")
print(f" Score: {entity['score']:.2f}")
print(f" Start: {entity['start']}, End: {entity['end']}")
print("-" * 40)