--- datasets: - yeajinmin/NER-News-BIDataset language: - ko --- ## Model Details ### Model Description NER-NewsBI-150142-e3b4 can recognize named entities in input sentences and predicts one label from a set of 150 labels for each named entity, thereby performing labeling for the input sentences. In particular, it is specialized for articles because it was trained using a news dataset. - base model: https://huggingface.co/xlm-roberta-large-finetuned-conll03-english - tokenizer: "xlm-roberta-large-finetuned-conll03-english" - dataset: https://huggingface.co/datasets/yeajinmin/NER-News-BIDataset Because the Base Model is a multilingual model, even though it was trained only for Korean, it can recognize entity names with 150 labels for other languages. Available languages can be checked in the language of the base model above. ### Training scores | Epoch | Training Loss | Validation Loss | F1 | |-------|---------------|------------------|----------| | 1 | 0.237400 | 0.213017 | 0.791144 | | 2 | 0.177400 | 0.174727 | 0.839951 | | 3 | 0.119500 | 0.157669 | 0.862055 | TrainOutput(global_step=90087, training_loss=0.19955111364530848, metrics={'train_runtime': 11692.8865, 'train_samples_per_second': 30.817, 'train_steps_per_second': 7.704, 'total_flos': 4.889673580336036e+16, 'train_loss': 0.19955111364530848, 'epoch': 3.0}) ## Uses ### Main Use The 151 entity name recognition labels that this model can recognize in sentences are listed in the table below. |index|0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67|68|69|70|71|72|73|74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91|92|93|94|95|96|97|98|99|100|101|102|103|104|105|106|107|108|109|110|111|112|113|114|115|116|117|118|119|120|121|122|123|124|125|126|127|128|129|130|131|132|133|134|135|136|137|138|139|140|141|142|143|144|145|146|147|148|149|150| |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| |Label|O|PS\_NAME|PS\_CHARACTER|PS\_PET|FD\_SCIENCE|FD\_SOCIAL\_SCIENCE|FD\_MEDICINE|FD\_ART|FD\_HUMANITIES|FD\_OTHERS|TR\_SCIENCE|TR\_SOCIAL\_SCIENCE|TR\_MEDICINE|TR\_ART|TR\_HUMANITIES|TR\_OTHERS|AF\_BUILDING|AF\_CULTURAL\_ASSET|AF\_ROAD|AF\_TRANSPORT|AF\_MUSICAL\_INSTRUMENT|AF\_WEAPON|AFA\_DOCUMENT|AFA\_PERFORMANCE|AFA\_VIDEO|AFA\_ART\_CRAFT|AFA\_MUSIC|AFW\_SERVICE\_PRODUCTS|AFW\_OTHER\_PRODUCTS|OGG\_ECONOMY|OGG\_EDUCATION|OGG\_MILITARY|OGG\_MEDIA|OGG\_SPORTS|OGG\_ART|OGG\_MEDICINE|OGG\_RELIGION|OGG\_SCIENCE|OGG\_LIBRARY|OGG\_LAW|OGG\_POLITICS|OGG\_FOOD|OGG\_HOTEL|OGG\_OTHERS|LCP\_COUNTRY|LCP\_PROVINCE|LCP\_COUNTY|LCP\_CITY|LCP\_CAPITALCITY|LCG\_RIVER|LCG\_OCEAN|LCG\_BAY|LCG\_MOUNTAIN|LCG\_ISLAND|LCG\_CONTINENT|LC\_SPACE|LC\_OTHERS|CV\_CULTURE|CV\_TRIBE|CV\_LANGUAGE|CV\_POLICY|CV\_LAW|CV\_CURRENCY|CV\_TAX|CV\_FUNDS|CV\_ART|CV\_SPORTS|CV\_SPORTS\_POSITION|CV\_SPORTS\_INST|CV\_PRIZE|CV\_RELATION|CV\_OCCUPATION|CV\_POSITION|CV\_FOOD|CV\_DRINK|CV\_FOOD\_STYLE|CV\_CLOTHING|CV\_BUILDING\_TYPE|DT\_DURATION|DT\_DAY|DT\_WEEK|DT\_MONTH|DT\_YEAR|DT\_SEASON|DT\_GEOAGE|DT\_DYNASTY|DT\_OTHERS|TI\_DURATION|TI\_HOUR|TI\_MINUTE|TI\_SECOND|TI\_OTHERS|QT\_AGE|QT\_SIZE|QT\_LENGTH|QT\_COUNT|QT\_MAN\_COUNT|QT\_WEIGHT|QT\_PERCENTAGE|QT\_SPEED|QT\_TEMPERATURE|QT\_VOLUME|QT\_ORDER|QT\_PRICE|QT\_PHONE|QT\_SPORTS|QT\_CHANNEL|QT\_ALBUM|QT\_ADDRESS|QT\_OTHERS|EV\_ACTIVITY|EV\_WAR\_REVOLUTION|EV\_SPORTS|EV\_FESTIVAL|EV\_OTHERS|AM\_INSECT|AM\_BIRD|AM\_FISH|AM\_MAMMALIA|AM\_AMPHIBIA|AM\_REPTILIA|AM\_TYPE|AM\_PART|AM\_OTHERS|PT\_FRUIT|PT\_FLOWER|PT\_TREE|PT\_GRASS|PT\_TYPE|PT\_PART|PT\_OTHERS|MT\_ELEMENT|MT\_METAL|MT\_ROCK|MT\_CHEMICAL|TM\_COLOR|TM\_DIRECTION|TM\_CLIMATE|TM\_SHAPE|TM\_CELL\_TISSUE\_ORGAN|TMM\_DISEASE|TMM\_DRUG|TMI\_HW|TMI\_SW|TMI\_SITE|TMI\_EMAIL|TMI\_MODEL|TMI\_SERVICE|TMI\_PROJECT|TMIG\_GENRE|TM\_SPORTS| ### How to Use ```python from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("yeajinmin/NER-NewsBI-150142-e3b4") model = AutoModelForTokenClassification.from_pretrained("yeajinmin/NER-NewsBI-150142-e3b4") from transformers import pipeline nlp_ner = pipeline("ner", model=model, tokenizer=tokenizer) text = "미국인 친구 Lisa에게 서울의 지하철 1호선만으로는 대구에 갈 수 없다고 알려주었다." results = nlp_ner(text) print(results) # for tabular output import pandas as pd df = pd.DataFrame([(result['word'], result['entity']) for result in results], columns=["단어", "개체명"]) print(df.to_markdown(index=False)) ``` The definition of index2tag and tag2index is required to classify the 150 NER labels. The code is below: ```python label_mapping = {'O': 0, 'PS_NAME': 1, 'PS_CHARACTER': 2, 'PS_PET': 3, 'FD_SCIENCE': 4, 'FD_SOCIAL_SCIENCE': 5, 'FD_MEDICINE': 6, 'FD_ART':7, 'FD_HUMANITIES': 8, 'FD_OTHERS': 9, 'TR_SCIENCE': 10, 'TR_SOCIAL_SCIENCE': 11, 'TR_MEDICINE': 12, 'TR_ART': 13, 'TR_HUMANITIES': 14, 'TR_OTHERS': 15, 'AF_BUILDING': 16, 'AF_CULTURAL_ASSET': 17, 'AF_ROAD': 18, 'AF_TRANSPORT': 19, 'AF_MUSICAL_INSTRUMENT': 20, 'AF_WEAPON': 21, 'AFA_DOCUMENT': 22, 'AFA_PERFORMANCE': 23, 'AFA_VIDEO': 24, 'AFA_ART_CRAFT': 25, 'AFA_MUSIC': 26, "AFW_SERVICE_PRODUCTS": 27, 'AFW_OTHER_PRODUCTS': 28, 'OGG_ECONOMY': 29, 'OGG_EDUCATION': 30, 'OGG_MILITARY': 31, 'OGG_MEDIA': 32, 'OGG_SPORTS': 33, 'OGG_ART': 34, 'OGG_MEDICINE': 35, 'OGG_RELIGION': 36, 'OGG_SCIENCE': 37, 'OGG_LIBRARY':38, 'OGG_LAW': 39, 'OGG_POLITICS': 40, 'OGG_FOOD': 41, 'OGG_HOTEL': 42, 'OGG_OTHERS': 43, 'LCP_COUNTRY': 44, 'LCP_PROVINCE': 45, 'LCP_COUNTY':46, 'LCP_CITY': 47, 'LCP_CAPITALCITY': 48, 'LCG_RIVER': 49, 'LCG_OCEAN': 50, 'LCG_BAY': 51, 'LCG_MOUNTAIN':52, 'LCG_ISLAND': 53, 'LCG_CONTINENT': 54, 'LC_SPACE': 55, 'LC_OTHERS': 56, 'CV_CULTURE': 57, 'CV_TRIBE': 58, 'CV_LANGUAGE': 59, 'CV_POLICY': 60, 'CV_LAW': 61, 'CV_CURRENCY': 62, 'CV_TAX': 63, 'CV_FUNDS': 64, 'CV_ART': 65, 'CV_SPORTS': 66, 'CV_SPORTS_POSITION': 67, 'CV_SPORTS_INST': 68, 'CV_PRIZE': 69, 'CV_RELATION': 70, 'CV_OCCUPATION': 71, 'CV_POSITION': 72, 'CV_FOOD': 73, 'CV_DRINK': 74, 'CV_FOOD_STYLE': 75, 'CV_CLOTHING': 76, 'CV_BUILDING_TYPE': 77, 'DT_DURATION': 78, 'DT_DAY': 79, 'DT_WEEK':80, 'DT_MONTH': 81, 'DT_YEAR': 82, 'DT_SEASON': 83, 'DT_GEOAGE': 84, 'DT_DYNASTY': 85, 'DT_OTHERS': 86, 'TI_DURATION': 87, 'TI_HOUR':88, 'TI_MINUTE': 89, 'TI_SECOND': 90, 'TI_OTHERS': 91, 'QT_AGE': 92, 'QT_SIZE': 93, 'QT_LENGTH': 94, 'QT_COUNT': 95, 'QT_MAN_COUNT': 96, 'QT_WEIGHT': 97, 'QT_PERCENTAGE': 98, 'QT_SPEED': 99, 'QT_TEMPERATURE': 100, 'QT_VOLUME': 101, 'QT_ORDER': 102, 'QT_PRICE': 103, 'QT_PHONE': 104, 'QT_SPORTS': 105, 'QT_CHANNEL': 106, 'QT_ALBUM': 107, 'QT_ADDRESS': 108, 'QT_OTHERS': 109, 'EV_ACTIVITY': 110, 'EV_WAR_REVOLUTION': 111, 'EV_SPORTS': 112, 'EV_FESTIVAL': 113, 'EV_OTHERS': 114, 'AM_INSECT': 115, 'AM_BIRD': 116, 'AM_FISH': 117, 'AM_MAMMALIA': 118, 'AM_AMPHIBIA': 119, 'AM_REPTILIA': 120, 'AM_TYPE': 121, 'AM_PART': 122, 'AM_OTHERS': 123, 'PT_FRUIT': 124, 'PT_FLOWER': 125, 'PT_TREE': 126, 'PT_GRASS': 127, 'PT_TYPE': 128, 'PT_PART': 129, 'PT_OTHERS': 130, 'MT_ELEMENT': 131, 'MT_METAL': 132, 'MT_ROCK':133, 'MT_CHEMICAL': 134, 'TM_COLOR': 135, 'TM_DIRECTION': 136, 'TM_CLIMATE': 137, 'TM_SHAPE': 138, 'TM_CELL_TISSUE_ORGAN': 139, 'TMM_DISEASE': 140, 'TMM_DRUG': 141, 'TMI_HW':142, 'TMI_SW': 143, 'TMI_SITE': 144, 'TMI_EMAIL': 145, 'TMI_MODEL': 146, 'TMI_SERVICE': 147, 'TMI_PROJECT': 148, 'TMIG_GENRE': 149, 'TM_SPORTS': 150} # Add label like B-entity name I-entity name new_label_mapping = {} for key, value in label_mapping.items(): if key == 'O': new_label_mapping[key] = value continue new_key_b = 'B-' + key new_key_i = 'I-' + key new_label_mapping[new_key_b] = value new_label_mapping[new_key_i] = value + 150 # Sort the new_label_mapping by values new_label_mapping = {k: v for k, v in sorted(new_label_mapping.items(), key=lambda item: item[1])} from datasets import Features, ClassLabel features = Features({'label': ClassLabel(num_classes=301, names=list(new_label_mapping.keys()))}) tags = features['label'] index2tag = {idx:tag for idx, tag in enumerate(tags.names)} tag2index = {tag:idx for idx, tag in enumerate(tags.names)} ``` ### Extended Usage Idea This model trained with the news dataset can be used to search for news articles. This is especially useful when the user does not know the exact name of a particular object name. You can search for cases without knowing the name of a specific entity at all through a search term query combining 'entity name label' + 'predicate'. For example, if you want to search for cases where a man-made building burned down, you can search for 'AF_BUILDING' + 'burned down' to see the actual cases and the name of the building. Just with a predicate search, when you search for 'burned', non-building cases such as forest fires will also appear as results. Even if you want to find a case where two countries signed an agreement, you can find the actual case and check the country name by using a search term query such as 'LCP_COUNTRY' + 'entered into an agreement'. This allows users to search for actual articles based on ‘context’ even without any information about the country. ## Performance Dataset used for evaluation Use 10000 of ‘test’ from the dataset in the link below ```python ds = dataset['test'] sliceds = {} sliceds = ds.select([i for i in range(10000)]) ``` - NER-NewsBI-150142-e3b4: https://huggingface.co/datasets/yeajinmin/NER-News-BIDataset - KcBert: https://huggingface.co/datasets/yeajinmin/News-NER-dataset-ForKCBERT - KoGPT2: https://huggingface.co/datasets/yeajinmin/News-NER-dataset-ForKoGPT2 |모델명|Precision|Recall|f1 score| |--------|----|----|----| |**NER-NewsBI-150142-e3b4**|**0.9208**|**0.9243**|**0.9225**| |KcBERT|0.9105|0.9197|0.9151| |KoGPT2|0.8032|0.8224|0.8127 If you would like to check other models trained for evaluation, check the link below: - KcBert: https://huggingface.co/yeajinmin/NER-News-kcbert-150142-e3b4 - KoGPT2: https://huggingface.co/yeajinmin/NER-News-KoGPT2-150142-e3b4