|
--- |
|
datasets: |
|
- yeajinmin/NER-News-BIDataset |
|
language: |
|
- ko |
|
--- |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
NER-NewsBI-150142-e3b4 can recognize named entities in input sentences and predicts one label from a set of 150 labels for each named entity, thereby performing labeling for the input sentences. |
|
In particular, it is specialized for articles because it was trained using a news dataset. |
|
|
|
- base model: https://huggingface.co/xlm-roberta-large-finetuned-conll03-english |
|
- tokenizer: "xlm-roberta-large-finetuned-conll03-english" |
|
- dataset: https://huggingface.co/datasets/yeajinmin/NER-News-BIDataset |
|
|
|
Because the Base Model is a multilingual model, even though it was trained only for Korean, it can recognize entity names with 150 labels for other languages. |
|
Available languages can be checked in the language of the base model above. |
|
|
|
### Training scores |
|
| Epoch | Training Loss | Validation Loss | F1 | |
|
|-------|---------------|------------------|----------| |
|
| 1 | 0.237400 | 0.213017 | 0.791144 | |
|
| 2 | 0.177400 | 0.174727 | 0.839951 | |
|
| 3 | 0.119500 | 0.157669 | 0.862055 | |
|
|
|
TrainOutput(global_step=90087, training_loss=0.19955111364530848, metrics={'train_runtime': 11692.8865, 'train_samples_per_second': 30.817, 'train_steps_per_second': 7.704, 'total_flos': 4.889673580336036e+16, 'train_loss': 0.19955111364530848, 'epoch': 3.0}) |
|
|
|
|
|
## Uses |
|
|
|
### Main Use |
|
|
|
The 151 entity name recognition labels that this model can recognize in sentences are listed in the table below. |
|
|index|0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67|68|69|70|71|72|73|74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91|92|93|94|95|96|97|98|99|100|101|102|103|104|105|106|107|108|109|110|111|112|113|114|115|116|117|118|119|120|121|122|123|124|125|126|127|128|129|130|131|132|133|134|135|136|137|138|139|140|141|142|143|144|145|146|147|148|149|150| |
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| |
|
|Label|O|PS\_NAME|PS\_CHARACTER|PS\_PET|FD\_SCIENCE|FD\_SOCIAL\_SCIENCE|FD\_MEDICINE|FD\_ART|FD\_HUMANITIES|FD\_OTHERS|TR\_SCIENCE|TR\_SOCIAL\_SCIENCE|TR\_MEDICINE|TR\_ART|TR\_HUMANITIES|TR\_OTHERS|AF\_BUILDING|AF\_CULTURAL\_ASSET|AF\_ROAD|AF\_TRANSPORT|AF\_MUSICAL\_INSTRUMENT|AF\_WEAPON|AFA\_DOCUMENT|AFA\_PERFORMANCE|AFA\_VIDEO|AFA\_ART\_CRAFT|AFA\_MUSIC|AFW\_SERVICE\_PRODUCTS|AFW\_OTHER\_PRODUCTS|OGG\_ECONOMY|OGG\_EDUCATION|OGG\_MILITARY|OGG\_MEDIA|OGG\_SPORTS|OGG\_ART|OGG\_MEDICINE|OGG\_RELIGION|OGG\_SCIENCE|OGG\_LIBRARY|OGG\_LAW|OGG\_POLITICS|OGG\_FOOD|OGG\_HOTEL|OGG\_OTHERS|LCP\_COUNTRY|LCP\_PROVINCE|LCP\_COUNTY|LCP\_CITY|LCP\_CAPITALCITY|LCG\_RIVER|LCG\_OCEAN|LCG\_BAY|LCG\_MOUNTAIN|LCG\_ISLAND|LCG\_CONTINENT|LC\_SPACE|LC\_OTHERS|CV\_CULTURE|CV\_TRIBE|CV\_LANGUAGE|CV\_POLICY|CV\_LAW|CV\_CURRENCY|CV\_TAX|CV\_FUNDS|CV\_ART|CV\_SPORTS|CV\_SPORTS\_POSITION|CV\_SPORTS\_INST|CV\_PRIZE|CV\_RELATION|CV\_OCCUPATION|CV\_POSITION|CV\_FOOD|CV\_DRINK|CV\_FOOD\_STYLE|CV\_CLOTHING|CV\_BUILDING\_TYPE|DT\_DURATION|DT\_DAY|DT\_WEEK|DT\_MONTH|DT\_YEAR|DT\_SEASON|DT\_GEOAGE|DT\_DYNASTY|DT\_OTHERS|TI\_DURATION|TI\_HOUR|TI\_MINUTE|TI\_SECOND|TI\_OTHERS|QT\_AGE|QT\_SIZE|QT\_LENGTH|QT\_COUNT|QT\_MAN\_COUNT|QT\_WEIGHT|QT\_PERCENTAGE|QT\_SPEED|QT\_TEMPERATURE|QT\_VOLUME|QT\_ORDER|QT\_PRICE|QT\_PHONE|QT\_SPORTS|QT\_CHANNEL|QT\_ALBUM|QT\_ADDRESS|QT\_OTHERS|EV\_ACTIVITY|EV\_WAR\_REVOLUTION|EV\_SPORTS|EV\_FESTIVAL|EV\_OTHERS|AM\_INSECT|AM\_BIRD|AM\_FISH|AM\_MAMMALIA|AM\_AMPHIBIA|AM\_REPTILIA|AM\_TYPE|AM\_PART|AM\_OTHERS|PT\_FRUIT|PT\_FLOWER|PT\_TREE|PT\_GRASS|PT\_TYPE|PT\_PART|PT\_OTHERS|MT\_ELEMENT|MT\_METAL|MT\_ROCK|MT\_CHEMICAL|TM\_COLOR|TM\_DIRECTION|TM\_CLIMATE|TM\_SHAPE|TM\_CELL\_TISSUE\_ORGAN|TMM\_DISEASE|TMM\_DRUG|TMI\_HW|TMI\_SW|TMI\_SITE|TMI\_EMAIL|TMI\_MODEL|TMI\_SERVICE|TMI\_PROJECT|TMIG\_GENRE|TM\_SPORTS| |
|
|
|
### How to Use |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("yeajinmin/NER-NewsBI-150142-e3b4") |
|
model = AutoModelForTokenClassification.from_pretrained("yeajinmin/NER-NewsBI-150142-e3b4") |
|
|
|
from transformers import pipeline |
|
|
|
nlp_ner = pipeline("ner", model=model, tokenizer=tokenizer) |
|
|
|
text = "미국인 친구 Lisa에게 서울의 지하철 1호선만으로는 대구에 갈 수 없다고 알려주었다." |
|
results = nlp_ner(text) |
|
|
|
print(results) |
|
|
|
# for tabular output |
|
import pandas as pd |
|
|
|
df = pd.DataFrame([(result['word'], result['entity']) for result in results], columns=["단어", "개체명"]) |
|
|
|
print(df.to_markdown(index=False)) |
|
``` |
|
|
|
The definition of index2tag and tag2index is required to classify the 150 NER labels. The code is below: |
|
```python |
|
label_mapping = {'O': 0, 'PS_NAME': 1, 'PS_CHARACTER': 2, 'PS_PET': 3, |
|
'FD_SCIENCE': 4, 'FD_SOCIAL_SCIENCE': 5, 'FD_MEDICINE': 6, 'FD_ART':7, 'FD_HUMANITIES': 8, 'FD_OTHERS': 9, |
|
'TR_SCIENCE': 10, 'TR_SOCIAL_SCIENCE': 11, 'TR_MEDICINE': 12, 'TR_ART': 13, 'TR_HUMANITIES': 14, 'TR_OTHERS': 15, |
|
'AF_BUILDING': 16, 'AF_CULTURAL_ASSET': 17, 'AF_ROAD': 18, 'AF_TRANSPORT': 19, 'AF_MUSICAL_INSTRUMENT': 20, |
|
'AF_WEAPON': 21, 'AFA_DOCUMENT': 22, 'AFA_PERFORMANCE': 23, 'AFA_VIDEO': 24, 'AFA_ART_CRAFT': 25, 'AFA_MUSIC': 26, "AFW_SERVICE_PRODUCTS": 27, 'AFW_OTHER_PRODUCTS': 28, |
|
'OGG_ECONOMY': 29, 'OGG_EDUCATION': 30, 'OGG_MILITARY': 31, 'OGG_MEDIA': 32, 'OGG_SPORTS': 33, 'OGG_ART': 34, 'OGG_MEDICINE': 35, 'OGG_RELIGION': 36, 'OGG_SCIENCE': 37, 'OGG_LIBRARY':38, |
|
'OGG_LAW': 39, 'OGG_POLITICS': 40, 'OGG_FOOD': 41, 'OGG_HOTEL': 42, 'OGG_OTHERS': 43, |
|
'LCP_COUNTRY': 44, 'LCP_PROVINCE': 45, 'LCP_COUNTY':46, 'LCP_CITY': 47, 'LCP_CAPITALCITY': 48, 'LCG_RIVER': 49, 'LCG_OCEAN': 50, |
|
'LCG_BAY': 51, 'LCG_MOUNTAIN':52, 'LCG_ISLAND': 53, 'LCG_CONTINENT': 54, 'LC_SPACE': 55, 'LC_OTHERS': 56, |
|
'CV_CULTURE': 57, 'CV_TRIBE': 58, 'CV_LANGUAGE': 59, 'CV_POLICY': 60, |
|
'CV_LAW': 61, 'CV_CURRENCY': 62, 'CV_TAX': 63, 'CV_FUNDS': 64, 'CV_ART': 65, 'CV_SPORTS': 66, 'CV_SPORTS_POSITION': 67, 'CV_SPORTS_INST': 68, 'CV_PRIZE': 69, 'CV_RELATION': 70, |
|
'CV_OCCUPATION': 71, 'CV_POSITION': 72, 'CV_FOOD': 73, 'CV_DRINK': 74, 'CV_FOOD_STYLE': 75, 'CV_CLOTHING': 76, 'CV_BUILDING_TYPE': 77, |
|
'DT_DURATION': 78, 'DT_DAY': 79, 'DT_WEEK':80, 'DT_MONTH': 81, 'DT_YEAR': 82, 'DT_SEASON': 83, 'DT_GEOAGE': 84, 'DT_DYNASTY': 85, 'DT_OTHERS': 86, |
|
'TI_DURATION': 87, 'TI_HOUR':88, 'TI_MINUTE': 89, 'TI_SECOND': 90, 'TI_OTHERS': 91, |
|
'QT_AGE': 92, 'QT_SIZE': 93, 'QT_LENGTH': 94, 'QT_COUNT': 95, 'QT_MAN_COUNT': 96, 'QT_WEIGHT': 97, 'QT_PERCENTAGE': 98, 'QT_SPEED': 99, 'QT_TEMPERATURE': 100, |
|
'QT_VOLUME': 101, 'QT_ORDER': 102, 'QT_PRICE': 103, 'QT_PHONE': 104, 'QT_SPORTS': 105, 'QT_CHANNEL': 106, 'QT_ALBUM': 107, 'QT_ADDRESS': 108, 'QT_OTHERS': 109, |
|
'EV_ACTIVITY': 110, 'EV_WAR_REVOLUTION': 111, 'EV_SPORTS': 112, 'EV_FESTIVAL': 113, 'EV_OTHERS': 114, |
|
'AM_INSECT': 115, 'AM_BIRD': 116, 'AM_FISH': 117, 'AM_MAMMALIA': 118, 'AM_AMPHIBIA': 119, 'AM_REPTILIA': 120, 'AM_TYPE': 121, 'AM_PART': 122, 'AM_OTHERS': 123, |
|
'PT_FRUIT': 124, 'PT_FLOWER': 125, 'PT_TREE': 126, 'PT_GRASS': 127, 'PT_TYPE': 128, 'PT_PART': 129, 'PT_OTHERS': 130, |
|
'MT_ELEMENT': 131, 'MT_METAL': 132, 'MT_ROCK':133, 'MT_CHEMICAL': 134, |
|
'TM_COLOR': 135, 'TM_DIRECTION': 136, 'TM_CLIMATE': 137, 'TM_SHAPE': 138, 'TM_CELL_TISSUE_ORGAN': 139, 'TMM_DISEASE': 140, 'TMM_DRUG': 141, 'TMI_HW':142, 'TMI_SW': 143, 'TMI_SITE': 144, 'TMI_EMAIL': 145, |
|
'TMI_MODEL': 146, 'TMI_SERVICE': 147, 'TMI_PROJECT': 148, 'TMIG_GENRE': 149, 'TM_SPORTS': 150} |
|
|
|
# Add label like B-entity name I-entity name |
|
new_label_mapping = {} |
|
for key, value in label_mapping.items(): |
|
if key == 'O': |
|
new_label_mapping[key] = value |
|
continue |
|
new_key_b = 'B-' + key |
|
new_key_i = 'I-' + key |
|
new_label_mapping[new_key_b] = value |
|
new_label_mapping[new_key_i] = value + 150 |
|
|
|
# Sort the new_label_mapping by values |
|
new_label_mapping = {k: v for k, v in sorted(new_label_mapping.items(), key=lambda item: item[1])} |
|
|
|
from datasets import Features, ClassLabel |
|
|
|
features = Features({'label': ClassLabel(num_classes=301, names=list(new_label_mapping.keys()))}) |
|
|
|
tags = features['label'] |
|
|
|
index2tag = {idx:tag for idx, tag in enumerate(tags.names)} |
|
tag2index = {tag:idx for idx, tag in enumerate(tags.names)} |
|
``` |
|
|
|
### Extended Usage Idea |
|
This model trained with the news dataset can be used to search for news articles. |
|
This is especially useful when the user does not know the exact name of a particular object name. |
|
You can search for cases without knowing the name of a specific entity at all through a search term query combining 'entity name label' + 'predicate'. |
|
For example, if you want to search for cases where a man-made building burned down, you can search for 'AF_BUILDING' + 'burned down' to see the actual cases and the name of the building. |
|
Just with a predicate search, when you search for 'burned', non-building cases such as forest fires will also appear as results. |
|
Even if you want to find a case where two countries signed an agreement, you can find the actual case and check the country name by using a search term query such as 'LCP_COUNTRY' + 'entered into an agreement'. This allows users to search for actual articles based on ‘context’ even without any information about the country. |
|
|
|
## Performance |
|
|
|
Dataset used for evaluation |
|
Use 10000 of ‘test’ from the dataset in the link below |
|
|
|
```python |
|
ds = dataset['test'] |
|
sliceds = {} |
|
sliceds = ds.select([i for i in range(10000)]) |
|
``` |
|
|
|
- NER-NewsBI-150142-e3b4: https://huggingface.co/datasets/yeajinmin/NER-News-BIDataset |
|
- KcBert: https://huggingface.co/datasets/yeajinmin/News-NER-dataset-ForKCBERT |
|
- KoGPT2: https://huggingface.co/datasets/yeajinmin/News-NER-dataset-ForKoGPT2 |
|
|
|
|모델명|Precision|Recall|f1 score| |
|
|--------|----|----|----| |
|
|**NER-NewsBI-150142-e3b4**|**0.9208**|**0.9243**|**0.9225**| |
|
|KcBERT|0.9105|0.9197|0.9151| |
|
|KoGPT2|0.8032|0.8224|0.8127 |
|
|
|
If you would like to check other models trained for evaluation, check the link below: |
|
- KcBert: https://huggingface.co/yeajinmin/NER-News-kcbert-150142-e3b4 |
|
- KoGPT2: https://huggingface.co/yeajinmin/NER-News-KoGPT2-150142-e3b4 |
|
|