license: mit
base_model:
- google-bert/bert-base-multilingual-uncased
tags:
- ner
- indonesian
- bert
language:
- id
library_name: transformers
ner-bert-indonesian-v1
Model Description
ner-bert-indonesian-v1 is a fine-tuned google-bert/bert-base-multilingual-uncased which is used for named-entity-recognition (NER) tasks in Indonesian. In version 1, the model is quite good at recognizing the following 4 entity types:
- 0 others (entities not yet recognized by the model) - Lainnya
- Person - Orang
- Organisation - Organisasi
- Place - Tempat/Lokasi
Usage
Using pipelines
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained('wuriyanto/ner-bert-indonesian-v1')
model = AutoModelForTokenClassification.from_pretrained('wuriyanto/ner-bert-indonesian-v1')
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "OpenAI adalah laboratorium penelitan kecerdasan buatan yang terdiri atas perusahaan waralaba OpenAI LP dan perusahaan induk nirlabanya, OpenAI Inc. Para pendirinya (sam altman) terdorong oleh ketakutan mereka akan kemungkinan bahwa kecerdasan buatan dapat mengancam keberadaan manusia, perusahaan ini ada di amerika serikat. PT. Indodana , salah satu perusahann di Indonesia mulai mengadopsi teknologi ini."
ner_results = nlp(example)
for n in ner_results:
print(n)
Using using custom parsers
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
id_to_label = {0: 'O', 1: 'Place', 2: 'Organisation', 3: 'Person'}
# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('wuriyanto/ner-bert-indonesian-v1')
model = AutoModelForTokenClassification.from_pretrained('wuriyanto/ner-bert-indonesian-v1')
def tokenize_input(sentence):
tokenized_input = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
return tokenized_input
def predict_ner(sentence):
inputs = tokenize_input(sentence)
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predictions = torch.argmax(logits, dim=2)
# Convert predictions and tokens back to readable format
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
predicted_labels = [id_to_label[p.item()] for p in predictions[0]]
# Merge subwords and filter out special tokens
merged_tokens, merged_labels = [], []
current_token, current_label = "", None
for token, label in zip(tokens, predicted_labels):
print(token, ' ', label)
# Skip special tokens and punctuation (like [CLS], [SEP], commas, and periods)
if token in ["[CLS]", "[SEP]"] or (label == "O" and token in [",", "."]):
continue
if token.startswith("##"):
current_token += token[2:]
if current_label == 'O':
current_label = label
else:
if current_token:
merged_tokens.append(current_token)
merged_labels.append(current_label)
current_token = token
current_label = label
if current_token:
merged_tokens.append(current_token)
merged_labels.append(current_label)
results = list(zip(merged_tokens, merged_labels))
return results
sentence = "OpenAI adalah laboratorium penelitan kecerdasan buatan yang terdiri atas perusahaan waralaba OpenAI LP dan perusahaan induk nirlabanya, OpenAI Inc. Para pendirinya (sam altman) terdorong oleh ketakutan mereka akan kemungkinan bahwa kecerdasan buatan dapat mengancam keberadaan manusia, perusahaan ini ada di amerika serikat. PT. Indodana , salah satu perusahann di Indonesia mulai mengadopsi teknologi ini."
results = predict_ner(sentence)
print(results)
for token, label in results:
print(f"{token}: {label}")
Dataset and citation info
@article{DBLP:journals/corr/abs-1810-04805,
author = {Jacob Devlin and
Ming{-}Wei Chang and
Kenton Lee and
Kristina Toutanova},
title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
Understanding},
journal = {CoRR},
volume = {abs/1810.04805},
year = {2018},
url = {http://arxiv.org/abs/1810.04805},
archivePrefix = {arXiv},
eprint = {1810.04805},
timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
The DEE NER dataset: Ika Alfina, Ruli Manurung, and Mohamad Ivan Fanany, "DBpedia Entities Expansion in Automatically Building Dataset for Indonesian NER", in Proceeding of 8th International Conference on Advanced Computer Science and Information Systems 2016 (ICACSIS 2016).
The MDEE and Singgalang NER dataset: Ika Alfina, Septiviana Savitri, and Mohamad Ivan Fanany, "Modified DBpedia Entities Expansion for Tagging Automatically NER Dataset", in Proceeding of 9th International Conference on Advanced Computer Science and Information Systems 2017 (ICACSIS 2017).
The Gold Standard: Andry Luthfi, Bayu Distiawan, and Ruli Manurung, "Building an Indonesian named entity recognizer using Wikipedia and DBPedia", in the Proceesing of 2014 International Conference on Asian Language Processing (IALP 2014).