NER in Urdu

muril_base_cased_urdu_ner_2.0

Besides the same base model and the NER dataset used for muril_base_cased_urdu_ner, I added a novel politics NER dataset translated from CrossNER. Since the additional dataset was small, the new labels may not be recognized effectively; however, the overall performance of recognizing the original 22 labels has increased compared to muril_base_cased_urdu_ner.

The base model is google/muril-base-cased, a BERT model pre-trained on 17 Indian languages and their transliterated counterparts. The main Urdu NER dataset is translated from the Hindi NER dataset from HiNER.

Usage

example:

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model = AutoModelForTokenClassification.from_pretrained("MichaelHuang/muril_base_cased_urdu_ner_2.0")
tokenizer = AutoTokenizer.from_pretrained("google/muril-base-cased")

# Define the labels dictionary
labels_dict = {
    0: "B-FESTIVAL",
    1: "B-GAME",
    2: "B-LANGUAGE",
    3: "B-LITERATURE",
    4: "B-LOCATION",
    5: "B-MISC",
    6: "B-NUMEX",
    7: "B-ORGANIZATION",
    8: "B-PERSON",
    9: "B-RELIGION",
    10: "B-TIMEX",
    11: "I-FESTIVAL",
    12: "I-GAME",
    13: "I-LANGUAGE",
    14: "I-LITERATURE",
    15: "I-LOCATION",
    16: "I-MISC",
    17: "I-NUMEX",
    18: "I-ORGANIZATION",
    19: "I-PERSON",
    20: "I-RELIGION",
    21: "I-TIMEX",
    22: "O",
    23: "B-ELECTION",
    24: "B-POLITICALPARTY",
    25: "B-POLITICIAN",
    26: "B-EVENT",
    27: "B-COUNTRY",
    28: "I-ELECTION",
    29: "I-POLITICALPARTY",
    30: "I-POLITICIAN",
    31: "I-EVENT",
    32: "I-COUNTRY"
}

def ner_predict(sentence, model, tokenizer, labels_dict):
    # Tokenize the input sentence
    inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True, max_length=128)

    # Perform inference
    with torch.no_grad():
        outputs = model(**inputs)

    # Get the predicted labels
    predicted_labels = torch.argmax(outputs.logits, dim=2)

    # Convert tokens and labels to lists
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    labels = predicted_labels.squeeze().tolist()

    # Map numeric labels to string labels
    predicted_labels = [labels_dict[label] for label in labels]

    # Combine tokens and labels
    result = list(zip(tokens, predicted_labels))

    return result

test_sentence = "امیتابھ اور ریکھا کی فلم 'گنگا کی سوگندھ' 10 فروری سنہ 1978 کو ریلیز ہوئی تھی۔ اس کے بعد راکھی، رندھیر کپور اور نیتو سنگھ کے ساتھ 'قسمے وعدے' 21 اپریل 1978 کو ریلیز ہوئی۔"
predictions = ner_predict(test_sentence, model, tokenizer, labels_dict)

for token, label in predictions:
    print(f"{token}: {label}")