File size: 11,140 Bytes
5e33244
 
 
 
 
 
 
f6ec727
 
 
 
 
 
af08cec
 
 
f6ec727
 
 
 
af9ff91
795c30b
 
 
 
 
 
af9ff91
 
 
f6ec727
 
83b538a
f6ec727
 
 
 
 
 
83b538a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f6ec727
 
 
 
 
 
 
 
83b538a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f6ec727
83b538a
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
---
datasets:
- yeajinmin/NER-News-BIDataset
language:
- ko
---

## Model Details  

### Model Description
NER-NewsBI-150142-e3b4 can recognize  named entities in input sentences and predicts one label from a set of 150 labels for each named entity, thereby performing labeling for the input sentences.    
In particular, it is specialized for articles because it was trained using a news dataset.   

- base model: https://huggingface.co/xlm-roberta-large-finetuned-conll03-english    
- tokenizer: "xlm-roberta-large-finetuned-conll03-english"   
- dataset: https://huggingface.co/datasets/yeajinmin/NER-News-BIDataset   

Because the Base Model is a multilingual model, even though it was trained only for Korean, it can recognize entity names with 150 labels for other languages.   
Available languages can be checked in the language of the base model above.   

### Training scores
| Epoch | Training Loss | Validation Loss | F1       |
|-------|---------------|------------------|----------|
| 1     | 0.237400      | 0.213017         | 0.791144 |
| 2     | 0.177400      | 0.174727         | 0.839951 |
| 3     | 0.119500      | 0.157669         | 0.862055 |

TrainOutput(global_step=90087, training_loss=0.19955111364530848, metrics={'train_runtime': 11692.8865, 'train_samples_per_second': 30.817, 'train_steps_per_second': 7.704, 'total_flos': 4.889673580336036e+16, 'train_loss': 0.19955111364530848, 'epoch': 3.0})   


## Uses

### Main Use  

The 151 entity name recognition labels that this model can recognize in sentences are listed in the table below.   
|index|0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67|68|69|70|71|72|73|74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91|92|93|94|95|96|97|98|99|100|101|102|103|104|105|106|107|108|109|110|111|112|113|114|115|116|117|118|119|120|121|122|123|124|125|126|127|128|129|130|131|132|133|134|135|136|137|138|139|140|141|142|143|144|145|146|147|148|149|150|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|Label|O|PS\_NAME|PS\_CHARACTER|PS\_PET|FD\_SCIENCE|FD\_SOCIAL\_SCIENCE|FD\_MEDICINE|FD\_ART|FD\_HUMANITIES|FD\_OTHERS|TR\_SCIENCE|TR\_SOCIAL\_SCIENCE|TR\_MEDICINE|TR\_ART|TR\_HUMANITIES|TR\_OTHERS|AF\_BUILDING|AF\_CULTURAL\_ASSET|AF\_ROAD|AF\_TRANSPORT|AF\_MUSICAL\_INSTRUMENT|AF\_WEAPON|AFA\_DOCUMENT|AFA\_PERFORMANCE|AFA\_VIDEO|AFA\_ART\_CRAFT|AFA\_MUSIC|AFW\_SERVICE\_PRODUCTS|AFW\_OTHER\_PRODUCTS|OGG\_ECONOMY|OGG\_EDUCATION|OGG\_MILITARY|OGG\_MEDIA|OGG\_SPORTS|OGG\_ART|OGG\_MEDICINE|OGG\_RELIGION|OGG\_SCIENCE|OGG\_LIBRARY|OGG\_LAW|OGG\_POLITICS|OGG\_FOOD|OGG\_HOTEL|OGG\_OTHERS|LCP\_COUNTRY|LCP\_PROVINCE|LCP\_COUNTY|LCP\_CITY|LCP\_CAPITALCITY|LCG\_RIVER|LCG\_OCEAN|LCG\_BAY|LCG\_MOUNTAIN|LCG\_ISLAND|LCG\_CONTINENT|LC\_SPACE|LC\_OTHERS|CV\_CULTURE|CV\_TRIBE|CV\_LANGUAGE|CV\_POLICY|CV\_LAW|CV\_CURRENCY|CV\_TAX|CV\_FUNDS|CV\_ART|CV\_SPORTS|CV\_SPORTS\_POSITION|CV\_SPORTS\_INST|CV\_PRIZE|CV\_RELATION|CV\_OCCUPATION|CV\_POSITION|CV\_FOOD|CV\_DRINK|CV\_FOOD\_STYLE|CV\_CLOTHING|CV\_BUILDING\_TYPE|DT\_DURATION|DT\_DAY|DT\_WEEK|DT\_MONTH|DT\_YEAR|DT\_SEASON|DT\_GEOAGE|DT\_DYNASTY|DT\_OTHERS|TI\_DURATION|TI\_HOUR|TI\_MINUTE|TI\_SECOND|TI\_OTHERS|QT\_AGE|QT\_SIZE|QT\_LENGTH|QT\_COUNT|QT\_MAN\_COUNT|QT\_WEIGHT|QT\_PERCENTAGE|QT\_SPEED|QT\_TEMPERATURE|QT\_VOLUME|QT\_ORDER|QT\_PRICE|QT\_PHONE|QT\_SPORTS|QT\_CHANNEL|QT\_ALBUM|QT\_ADDRESS|QT\_OTHERS|EV\_ACTIVITY|EV\_WAR\_REVOLUTION|EV\_SPORTS|EV\_FESTIVAL|EV\_OTHERS|AM\_INSECT|AM\_BIRD|AM\_FISH|AM\_MAMMALIA|AM\_AMPHIBIA|AM\_REPTILIA|AM\_TYPE|AM\_PART|AM\_OTHERS|PT\_FRUIT|PT\_FLOWER|PT\_TREE|PT\_GRASS|PT\_TYPE|PT\_PART|PT\_OTHERS|MT\_ELEMENT|MT\_METAL|MT\_ROCK|MT\_CHEMICAL|TM\_COLOR|TM\_DIRECTION|TM\_CLIMATE|TM\_SHAPE|TM\_CELL\_TISSUE\_ORGAN|TMM\_DISEASE|TMM\_DRUG|TMI\_HW|TMI\_SW|TMI\_SITE|TMI\_EMAIL|TMI\_MODEL|TMI\_SERVICE|TMI\_PROJECT|TMIG\_GENRE|TM\_SPORTS|    

### How to Use  

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("yeajinmin/NER-NewsBI-150142-e3b4")
model = AutoModelForTokenClassification.from_pretrained("yeajinmin/NER-NewsBI-150142-e3b4")

from transformers import pipeline

nlp_ner = pipeline("ner", model=model, tokenizer=tokenizer)

text = "미국인 친구 Lisa에게 서울의 지하철 1호선만으로는 대구에 갈 수 없다고 알려주었다."
results = nlp_ner(text)

print(results)

# for tabular output
import pandas as pd

df = pd.DataFrame([(result['word'], result['entity']) for result in results], columns=["단어", "개체명"])

print(df.to_markdown(index=False))
```

The definition of index2tag and tag2index is required to classify the 150 NER labels. The code is below:  
```python
label_mapping = {'O': 0, 'PS_NAME': 1, 'PS_CHARACTER': 2, 'PS_PET': 3,
                 'FD_SCIENCE': 4, 'FD_SOCIAL_SCIENCE': 5, 'FD_MEDICINE': 6, 'FD_ART':7, 'FD_HUMANITIES': 8, 'FD_OTHERS': 9,
                 'TR_SCIENCE': 10, 'TR_SOCIAL_SCIENCE': 11, 'TR_MEDICINE': 12, 'TR_ART': 13, 'TR_HUMANITIES': 14, 'TR_OTHERS': 15,
                 'AF_BUILDING': 16, 'AF_CULTURAL_ASSET': 17, 'AF_ROAD': 18, 'AF_TRANSPORT': 19, 'AF_MUSICAL_INSTRUMENT': 20,
                 'AF_WEAPON': 21, 'AFA_DOCUMENT': 22, 'AFA_PERFORMANCE': 23, 'AFA_VIDEO': 24, 'AFA_ART_CRAFT': 25, 'AFA_MUSIC': 26, "AFW_SERVICE_PRODUCTS": 27, 'AFW_OTHER_PRODUCTS': 28,
                 'OGG_ECONOMY': 29, 'OGG_EDUCATION': 30, 'OGG_MILITARY': 31, 'OGG_MEDIA': 32, 'OGG_SPORTS': 33, 'OGG_ART': 34, 'OGG_MEDICINE': 35, 'OGG_RELIGION': 36, 'OGG_SCIENCE': 37, 'OGG_LIBRARY':38,
                 'OGG_LAW': 39, 'OGG_POLITICS': 40, 'OGG_FOOD': 41, 'OGG_HOTEL': 42, 'OGG_OTHERS': 43,
                 'LCP_COUNTRY': 44, 'LCP_PROVINCE': 45, 'LCP_COUNTY':46, 'LCP_CITY': 47, 'LCP_CAPITALCITY': 48, 'LCG_RIVER': 49, 'LCG_OCEAN': 50,
                 'LCG_BAY': 51, 'LCG_MOUNTAIN':52, 'LCG_ISLAND': 53, 'LCG_CONTINENT': 54, 'LC_SPACE': 55, 'LC_OTHERS': 56,
                 'CV_CULTURE': 57, 'CV_TRIBE': 58, 'CV_LANGUAGE': 59, 'CV_POLICY': 60,
                 'CV_LAW': 61, 'CV_CURRENCY': 62, 'CV_TAX': 63, 'CV_FUNDS': 64, 'CV_ART': 65, 'CV_SPORTS': 66, 'CV_SPORTS_POSITION': 67, 'CV_SPORTS_INST': 68, 'CV_PRIZE': 69, 'CV_RELATION': 70,
                 'CV_OCCUPATION': 71, 'CV_POSITION': 72, 'CV_FOOD': 73, 'CV_DRINK': 74, 'CV_FOOD_STYLE': 75, 'CV_CLOTHING': 76, 'CV_BUILDING_TYPE': 77,
                 'DT_DURATION': 78, 'DT_DAY': 79, 'DT_WEEK':80, 'DT_MONTH': 81, 'DT_YEAR': 82, 'DT_SEASON': 83, 'DT_GEOAGE': 84, 'DT_DYNASTY': 85, 'DT_OTHERS': 86,
                 'TI_DURATION': 87, 'TI_HOUR':88, 'TI_MINUTE': 89, 'TI_SECOND': 90, 'TI_OTHERS': 91,
                 'QT_AGE': 92, 'QT_SIZE': 93, 'QT_LENGTH': 94, 'QT_COUNT': 95, 'QT_MAN_COUNT': 96, 'QT_WEIGHT': 97, 'QT_PERCENTAGE': 98, 'QT_SPEED': 99, 'QT_TEMPERATURE': 100,
                 'QT_VOLUME': 101, 'QT_ORDER': 102, 'QT_PRICE': 103, 'QT_PHONE': 104, 'QT_SPORTS': 105, 'QT_CHANNEL': 106, 'QT_ALBUM': 107, 'QT_ADDRESS': 108, 'QT_OTHERS': 109,
                 'EV_ACTIVITY': 110, 'EV_WAR_REVOLUTION': 111, 'EV_SPORTS': 112, 'EV_FESTIVAL': 113, 'EV_OTHERS': 114,
                 'AM_INSECT': 115, 'AM_BIRD': 116, 'AM_FISH': 117, 'AM_MAMMALIA': 118, 'AM_AMPHIBIA': 119, 'AM_REPTILIA': 120, 'AM_TYPE': 121, 'AM_PART': 122, 'AM_OTHERS': 123,
                 'PT_FRUIT': 124, 'PT_FLOWER': 125, 'PT_TREE': 126, 'PT_GRASS': 127, 'PT_TYPE': 128, 'PT_PART': 129, 'PT_OTHERS': 130,
                 'MT_ELEMENT': 131, 'MT_METAL': 132, 'MT_ROCK':133, 'MT_CHEMICAL': 134,
                 'TM_COLOR': 135, 'TM_DIRECTION': 136, 'TM_CLIMATE': 137, 'TM_SHAPE': 138, 'TM_CELL_TISSUE_ORGAN': 139, 'TMM_DISEASE': 140, 'TMM_DRUG': 141, 'TMI_HW':142, 'TMI_SW': 143, 'TMI_SITE': 144, 'TMI_EMAIL': 145,
                 'TMI_MODEL': 146, 'TMI_SERVICE': 147, 'TMI_PROJECT': 148, 'TMIG_GENRE': 149, 'TM_SPORTS': 150}

# Add label like B-entity name I-entity name
new_label_mapping = {}
for key, value in label_mapping.items():
    if key == 'O':
        new_label_mapping[key] = value
        continue
    new_key_b = 'B-' + key
    new_key_i = 'I-' + key
    new_label_mapping[new_key_b] = value
    new_label_mapping[new_key_i] = value + 150

# Sort the new_label_mapping by values
new_label_mapping = {k: v for k, v in sorted(new_label_mapping.items(), key=lambda item: item[1])}

from datasets import Features, ClassLabel

features = Features({'label': ClassLabel(num_classes=301, names=list(new_label_mapping.keys()))})

tags = features['label']

index2tag = {idx:tag for idx, tag in enumerate(tags.names)}
tag2index = {tag:idx for idx, tag in enumerate(tags.names)}
```

### Extended Usage Idea  
This model trained with the news dataset can be used to search for news articles.     
This is especially useful when the user does not know the exact name of a particular object name.    
You can search for cases without knowing the name of a specific entity at all through a search term query combining 'entity name label' + 'predicate'.     
For example, if you want to search for cases where a man-made building burned down, you can search for 'AF_BUILDING' + 'burned down' to see the actual cases and the name of the building.    
Just with a predicate search, when you search for 'burned', non-building cases such as forest fires will also appear as results.     
Even if you want to find a case where two countries signed an agreement, you can find the actual case and check the country name by using a search term query such as 'LCP_COUNTRY' + 'entered into an agreement'. This allows users to search for actual articles based on ‘context’ even without any information about the country.    

## Performance  

Dataset used for evaluation  
Use 10000 of ‘test’ from the dataset in the link below    

```python
ds = dataset['test']
sliceds = {}
sliceds = ds.select([i for i in range(10000)])
```

- NER-NewsBI-150142-e3b4: https://huggingface.co/datasets/yeajinmin/NER-News-BIDataset
- KcBert: https://huggingface.co/datasets/yeajinmin/News-NER-dataset-ForKCBERT
- KoGPT2: https://huggingface.co/datasets/yeajinmin/News-NER-dataset-ForKoGPT2

|모델명|Precision|Recall|f1 score|  
|--------|----|----|----|
|**NER-NewsBI-150142-e3b4**|**0.9208**|**0.9243**|**0.9225**|
|KcBERT|0.9105|0.9197|0.9151|
|KoGPT2|0.8032|0.8224|0.8127    

If you would like to check other models trained for evaluation, check the link below:  
- KcBert: https://huggingface.co/yeajinmin/NER-News-kcbert-150142-e3b4
- KoGPT2: https://huggingface.co/yeajinmin/NER-News-KoGPT2-150142-e3b4