Edit model card

Model Details

Model Description

NER-NewsBI-150142-e3b4 can recognize named entities in input sentences and predicts one label from a set of 150 labels for each named entity, thereby performing labeling for the input sentences.
In particular, it is specialized for articles because it was trained using a news dataset.

Because the Base Model is a multilingual model, even though it was trained only for Korean, it can recognize entity names with 150 labels for other languages.
Available languages can be checked in the language of the base model above.

Training scores

Epoch Training Loss Validation Loss F1
1 0.237400 0.213017 0.791144
2 0.177400 0.174727 0.839951
3 0.119500 0.157669 0.862055

TrainOutput(global_step=90087, training_loss=0.19955111364530848, metrics={'train_runtime': 11692.8865, 'train_samples_per_second': 30.817, 'train_steps_per_second': 7.704, 'total_flos': 4.889673580336036e+16, 'train_loss': 0.19955111364530848, 'epoch': 3.0})

Uses

Main Use

The 151 entity name recognition labels that this model can recognize in sentences are listed in the table below.

index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150
Label O PS_NAME PS_CHARACTER PS_PET FD_SCIENCE FD_SOCIAL_SCIENCE FD_MEDICINE FD_ART FD_HUMANITIES FD_OTHERS TR_SCIENCE TR_SOCIAL_SCIENCE TR_MEDICINE TR_ART TR_HUMANITIES TR_OTHERS AF_BUILDING AF_CULTURAL_ASSET AF_ROAD AF_TRANSPORT AF_MUSICAL_INSTRUMENT AF_WEAPON AFA_DOCUMENT AFA_PERFORMANCE AFA_VIDEO AFA_ART_CRAFT AFA_MUSIC AFW_SERVICE_PRODUCTS AFW_OTHER_PRODUCTS OGG_ECONOMY OGG_EDUCATION OGG_MILITARY OGG_MEDIA OGG_SPORTS OGG_ART OGG_MEDICINE OGG_RELIGION OGG_SCIENCE OGG_LIBRARY OGG_LAW OGG_POLITICS OGG_FOOD OGG_HOTEL OGG_OTHERS LCP_COUNTRY LCP_PROVINCE LCP_COUNTY LCP_CITY LCP_CAPITALCITY LCG_RIVER LCG_OCEAN LCG_BAY LCG_MOUNTAIN LCG_ISLAND LCG_CONTINENT LC_SPACE LC_OTHERS CV_CULTURE CV_TRIBE CV_LANGUAGE CV_POLICY CV_LAW CV_CURRENCY CV_TAX CV_FUNDS CV_ART CV_SPORTS CV_SPORTS_POSITION CV_SPORTS_INST CV_PRIZE CV_RELATION CV_OCCUPATION CV_POSITION CV_FOOD CV_DRINK CV_FOOD_STYLE CV_CLOTHING CV_BUILDING_TYPE DT_DURATION DT_DAY DT_WEEK DT_MONTH DT_YEAR DT_SEASON DT_GEOAGE DT_DYNASTY DT_OTHERS TI_DURATION TI_HOUR TI_MINUTE TI_SECOND TI_OTHERS QT_AGE QT_SIZE QT_LENGTH QT_COUNT QT_MAN_COUNT QT_WEIGHT QT_PERCENTAGE QT_SPEED QT_TEMPERATURE QT_VOLUME QT_ORDER QT_PRICE QT_PHONE QT_SPORTS QT_CHANNEL QT_ALBUM QT_ADDRESS QT_OTHERS EV_ACTIVITY EV_WAR_REVOLUTION EV_SPORTS EV_FESTIVAL EV_OTHERS AM_INSECT AM_BIRD AM_FISH AM_MAMMALIA AM_AMPHIBIA AM_REPTILIA AM_TYPE AM_PART AM_OTHERS PT_FRUIT PT_FLOWER PT_TREE PT_GRASS PT_TYPE PT_PART PT_OTHERS MT_ELEMENT MT_METAL MT_ROCK MT_CHEMICAL TM_COLOR TM_DIRECTION TM_CLIMATE TM_SHAPE TM_CELL_TISSUE_ORGAN TMM_DISEASE TMM_DRUG TMI_HW TMI_SW TMI_SITE TMI_EMAIL TMI_MODEL TMI_SERVICE TMI_PROJECT TMIG_GENRE TM_SPORTS

How to Use

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("yeajinmin/NER-NewsBI-150142-e3b4")
model = AutoModelForTokenClassification.from_pretrained("yeajinmin/NER-NewsBI-150142-e3b4")

from transformers import pipeline

nlp_ner = pipeline("ner", model=model, tokenizer=tokenizer)

text = "미국인 친구 Lisa에게 서울의 지하철 1호선만으로는 대구에 갈 수 없다고 알려주었다."
results = nlp_ner(text)

print(results)

# for tabular output
import pandas as pd

df = pd.DataFrame([(result['word'], result['entity']) for result in results], columns=["단어", "개체명"])

print(df.to_markdown(index=False))

The definition of index2tag and tag2index is required to classify the 150 NER labels. The code is below:

label_mapping = {'O': 0, 'PS_NAME': 1, 'PS_CHARACTER': 2, 'PS_PET': 3,
                 'FD_SCIENCE': 4, 'FD_SOCIAL_SCIENCE': 5, 'FD_MEDICINE': 6, 'FD_ART':7, 'FD_HUMANITIES': 8, 'FD_OTHERS': 9,
                 'TR_SCIENCE': 10, 'TR_SOCIAL_SCIENCE': 11, 'TR_MEDICINE': 12, 'TR_ART': 13, 'TR_HUMANITIES': 14, 'TR_OTHERS': 15,
                 'AF_BUILDING': 16, 'AF_CULTURAL_ASSET': 17, 'AF_ROAD': 18, 'AF_TRANSPORT': 19, 'AF_MUSICAL_INSTRUMENT': 20,
                 'AF_WEAPON': 21, 'AFA_DOCUMENT': 22, 'AFA_PERFORMANCE': 23, 'AFA_VIDEO': 24, 'AFA_ART_CRAFT': 25, 'AFA_MUSIC': 26, "AFW_SERVICE_PRODUCTS": 27, 'AFW_OTHER_PRODUCTS': 28,
                 'OGG_ECONOMY': 29, 'OGG_EDUCATION': 30, 'OGG_MILITARY': 31, 'OGG_MEDIA': 32, 'OGG_SPORTS': 33, 'OGG_ART': 34, 'OGG_MEDICINE': 35, 'OGG_RELIGION': 36, 'OGG_SCIENCE': 37, 'OGG_LIBRARY':38,
                 'OGG_LAW': 39, 'OGG_POLITICS': 40, 'OGG_FOOD': 41, 'OGG_HOTEL': 42, 'OGG_OTHERS': 43,
                 'LCP_COUNTRY': 44, 'LCP_PROVINCE': 45, 'LCP_COUNTY':46, 'LCP_CITY': 47, 'LCP_CAPITALCITY': 48, 'LCG_RIVER': 49, 'LCG_OCEAN': 50,
                 'LCG_BAY': 51, 'LCG_MOUNTAIN':52, 'LCG_ISLAND': 53, 'LCG_CONTINENT': 54, 'LC_SPACE': 55, 'LC_OTHERS': 56,
                 'CV_CULTURE': 57, 'CV_TRIBE': 58, 'CV_LANGUAGE': 59, 'CV_POLICY': 60,
                 'CV_LAW': 61, 'CV_CURRENCY': 62, 'CV_TAX': 63, 'CV_FUNDS': 64, 'CV_ART': 65, 'CV_SPORTS': 66, 'CV_SPORTS_POSITION': 67, 'CV_SPORTS_INST': 68, 'CV_PRIZE': 69, 'CV_RELATION': 70,
                 'CV_OCCUPATION': 71, 'CV_POSITION': 72, 'CV_FOOD': 73, 'CV_DRINK': 74, 'CV_FOOD_STYLE': 75, 'CV_CLOTHING': 76, 'CV_BUILDING_TYPE': 77,
                 'DT_DURATION': 78, 'DT_DAY': 79, 'DT_WEEK':80, 'DT_MONTH': 81, 'DT_YEAR': 82, 'DT_SEASON': 83, 'DT_GEOAGE': 84, 'DT_DYNASTY': 85, 'DT_OTHERS': 86,
                 'TI_DURATION': 87, 'TI_HOUR':88, 'TI_MINUTE': 89, 'TI_SECOND': 90, 'TI_OTHERS': 91,
                 'QT_AGE': 92, 'QT_SIZE': 93, 'QT_LENGTH': 94, 'QT_COUNT': 95, 'QT_MAN_COUNT': 96, 'QT_WEIGHT': 97, 'QT_PERCENTAGE': 98, 'QT_SPEED': 99, 'QT_TEMPERATURE': 100,
                 'QT_VOLUME': 101, 'QT_ORDER': 102, 'QT_PRICE': 103, 'QT_PHONE': 104, 'QT_SPORTS': 105, 'QT_CHANNEL': 106, 'QT_ALBUM': 107, 'QT_ADDRESS': 108, 'QT_OTHERS': 109,
                 'EV_ACTIVITY': 110, 'EV_WAR_REVOLUTION': 111, 'EV_SPORTS': 112, 'EV_FESTIVAL': 113, 'EV_OTHERS': 114,
                 'AM_INSECT': 115, 'AM_BIRD': 116, 'AM_FISH': 117, 'AM_MAMMALIA': 118, 'AM_AMPHIBIA': 119, 'AM_REPTILIA': 120, 'AM_TYPE': 121, 'AM_PART': 122, 'AM_OTHERS': 123,
                 'PT_FRUIT': 124, 'PT_FLOWER': 125, 'PT_TREE': 126, 'PT_GRASS': 127, 'PT_TYPE': 128, 'PT_PART': 129, 'PT_OTHERS': 130,
                 'MT_ELEMENT': 131, 'MT_METAL': 132, 'MT_ROCK':133, 'MT_CHEMICAL': 134,
                 'TM_COLOR': 135, 'TM_DIRECTION': 136, 'TM_CLIMATE': 137, 'TM_SHAPE': 138, 'TM_CELL_TISSUE_ORGAN': 139, 'TMM_DISEASE': 140, 'TMM_DRUG': 141, 'TMI_HW':142, 'TMI_SW': 143, 'TMI_SITE': 144, 'TMI_EMAIL': 145,
                 'TMI_MODEL': 146, 'TMI_SERVICE': 147, 'TMI_PROJECT': 148, 'TMIG_GENRE': 149, 'TM_SPORTS': 150}

# Add label like B-entity name I-entity name
new_label_mapping = {}
for key, value in label_mapping.items():
    if key == 'O':
        new_label_mapping[key] = value
        continue
    new_key_b = 'B-' + key
    new_key_i = 'I-' + key
    new_label_mapping[new_key_b] = value
    new_label_mapping[new_key_i] = value + 150

# Sort the new_label_mapping by values
new_label_mapping = {k: v for k, v in sorted(new_label_mapping.items(), key=lambda item: item[1])}

from datasets import Features, ClassLabel

features = Features({'label': ClassLabel(num_classes=301, names=list(new_label_mapping.keys()))})

tags = features['label']

index2tag = {idx:tag for idx, tag in enumerate(tags.names)}
tag2index = {tag:idx for idx, tag in enumerate(tags.names)}

Extended Usage Idea

This model trained with the news dataset can be used to search for news articles.
This is especially useful when the user does not know the exact name of a particular object name.
You can search for cases without knowing the name of a specific entity at all through a search term query combining 'entity name label' + 'predicate'.
For example, if you want to search for cases where a man-made building burned down, you can search for 'AF_BUILDING' + 'burned down' to see the actual cases and the name of the building.
Just with a predicate search, when you search for 'burned', non-building cases such as forest fires will also appear as results.
Even if you want to find a case where two countries signed an agreement, you can find the actual case and check the country name by using a search term query such as 'LCP_COUNTRY' + 'entered into an agreement'. This allows users to search for actual articles based on ‘context’ even without any information about the country.

Performance

Dataset used for evaluation
Use 10000 of ‘test’ from the dataset in the link below

ds = dataset['test']
sliceds = {}
sliceds = ds.select([i for i in range(10000)])
모델명 Precision Recall f1 score
NER-NewsBI-150142-e3b4 0.9208 0.9243 0.9225
KcBERT 0.9105 0.9197 0.9151
KoGPT2 0.8032 0.8224 0.8127

If you would like to check other models trained for evaluation, check the link below:

Downloads last month
58
Safetensors
Model size
559M params
Tensor type
F32
·

Dataset used to train yeajinmin/NER-NewsBI-150142-e3b4