MichaelHuang commited on
Commit
0e7e791
1 Parent(s): ccd4d34

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -0
README.md ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ur
4
+ tags:
5
+ - ner
6
+ ---
7
+
8
+ # NER in Urdu
9
+ ## muril_base_cased_urdu_ner_2.0
10
+
11
+ Besides the same base model and the NER dataset used for muril_base_cased_urdu_ner, I added a novel politics NER dataset translated from [CrossNER](https://github.com/zliucr/CrossNER/tree/main)
12
+ Since the additional dataset was small, the new labels may not be recognized effectively; however, the overall performance of recognizing the original 22 labels has increased compared to muril_base_cased_urdu_ner.
13
+
14
+ Base model is [google/muril-base-cased](https://huggingface.co/google/muril-base-cased), a BERT model pre-trained on 17 Indian languages and their transliterated counterparts.
15
+ The main Urdu NER dataset is translated from the Hindi NER dataset from [HiNER](https://github.com/cfiltnlp/HiNER).
16
+
17
+ ## Usage
18
+ ### example:
19
+ ```python
20
+ from transformers import AutoModelForTokenClassification, AutoTokenizer
21
+ import torch
22
+
23
+ model = AutoModelForTokenClassification.from_pretrained("MichaelHuang/muril_base_cased_urdu_ner_2.0")
24
+ tokenizer = AutoTokenizer.from_pretrained("google/muril-base-cased")
25
+
26
+ # Define the labels dictionary
27
+ labels_dict = {
28
+ 0: "B-FESTIVAL",
29
+ 1: "B-GAME",
30
+ 2: "B-LANGUAGE",
31
+ 3: "B-LITERATURE",
32
+ 4: "B-LOCATION",
33
+ 5: "B-MISC",
34
+ 6: "B-NUMEX",
35
+ 7: "B-ORGANIZATION",
36
+ 8: "B-PERSON",
37
+ 9: "B-RELIGION",
38
+ 10: "B-TIMEX",
39
+ 11: "I-FESTIVAL",
40
+ 12: "I-GAME",
41
+ 13: "I-LANGUAGE",
42
+ 14: "I-LITERATURE",
43
+ 15: "I-LOCATION",
44
+ 16: "I-MISC",
45
+ 17: "I-NUMEX",
46
+ 18: "I-ORGANIZATION",
47
+ 19: "I-PERSON",
48
+ 20: "I-RELIGION",
49
+ 21: "I-TIMEX",
50
+ 22: "O",
51
+ 23: "B-ELECTION",
52
+ 24: "B-POLITICALPARTY",
53
+ 25: "B-POLITICIAN",
54
+ 26: "B-EVENT",
55
+ 27: "B-COUNTRY",
56
+ 28: "I-ELECTION",
57
+ 29: "I-POLITICALPARTY",
58
+ 30: "I-POLITICIAN",
59
+ 31: "I-EVENT",
60
+ 32: "I-COUNTRY"
61
+ }
62
+
63
+ def ner_predict(sentence, model, tokenizer, labels_dict):
64
+ # Tokenize the input sentence
65
+ inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True, max_length=128)
66
+
67
+ # Perform inference
68
+ with torch.no_grad():
69
+ outputs = model(**inputs)
70
+
71
+ # Get the predicted labels
72
+ predicted_labels = torch.argmax(outputs.logits, dim=2)
73
+
74
+ # Convert tokens and labels to lists
75
+ tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
76
+ labels = predicted_labels.squeeze().tolist()
77
+
78
+ # Map numeric labels to string labels
79
+ predicted_labels = [labels_dict[label] for label in labels]
80
+
81
+ # Combine tokens and labels
82
+ result = list(zip(tokens, predicted_labels))
83
+
84
+ return result
85
+
86
+ test_sentence = "امیتابھ اور ریکھا کی فلم 'گنگا کی سوگندھ' 10 فروری سنہ 1978 کو ریلیز ہوئی تھی۔ اس کے بعد راکھی، رندھیر کپور اور نیتو سنگھ کے ساتھ 'قسمے وعدے' 21 اپریل 1978 کو ریلیز ہوئی۔"
87
+ predictions = ner_predict(test_sentence, model, tokenizer, labels_dict)
88
+
89
+ for token, label in predictions:
90
+ print(f"{token}: {label}")
91
+ ```