wuriyanto's picture
Update README.md
3aa29a0 verified
metadata
license: mit
base_model:
  - google-bert/bert-base-multilingual-uncased
tags:
  - ner
  - indonesian
  - bert
language:
  - id
library_name: transformers

ner-bert-indonesian-v1

Model Description

ner-bert-indonesian-v1 is a fine-tuned google-bert/bert-base-multilingual-uncased which is used for named-entity-recognition (NER) tasks in Indonesian. In version 1, the model is quite good at recognizing the following 4 entity types:

  • 0 others (entities not yet recognized by the model) - Lainnya
  • Person - Orang
  • Organisation - Organisasi
  • Place - Tempat/Lokasi

Usage

Using pipelines

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained('wuriyanto/ner-bert-indonesian-v1')
model = AutoModelForTokenClassification.from_pretrained('wuriyanto/ner-bert-indonesian-v1')

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "OpenAI adalah laboratorium penelitan kecerdasan buatan yang terdiri atas perusahaan waralaba OpenAI LP dan perusahaan induk nirlabanya, OpenAI Inc. Para pendirinya (sam altman) terdorong oleh ketakutan mereka akan kemungkinan bahwa kecerdasan buatan dapat mengancam keberadaan manusia, perusahaan ini ada di amerika serikat. PT. Indodana , salah satu perusahann di Indonesia mulai mengadopsi teknologi ini."

ner_results = nlp(example)
for n in ner_results:
  print(n)

Using using custom parsers

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

id_to_label = {0: 'O', 1: 'Place', 2: 'Organisation', 3: 'Person'}

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('wuriyanto/ner-bert-indonesian-v1')
model = AutoModelForTokenClassification.from_pretrained('wuriyanto/ner-bert-indonesian-v1')

def tokenize_input(sentence):
  tokenized_input = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
  return tokenized_input

def predict_ner(sentence):
    inputs = tokenize_input(sentence)

    with torch.no_grad():
        outputs = model(**inputs)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=2)

    # Convert predictions and tokens back to readable format
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    predicted_labels = [id_to_label[p.item()] for p in predictions[0]]

    # Merge subwords and filter out special tokens
    merged_tokens, merged_labels = [], []
    current_token, current_label = "", None
    for token, label in zip(tokens, predicted_labels):
        print(token, ' ', label)
        # Skip special tokens and punctuation (like [CLS], [SEP], commas, and periods)
        if token in ["[CLS]", "[SEP]"] or (label == "O" and token in [",", "."]):
            continue
        if token.startswith("##"):
            current_token += token[2:]
            if current_label == 'O':
              current_label = label
        else:
            if current_token:
                merged_tokens.append(current_token)
                merged_labels.append(current_label)
            current_token = token
            current_label = label
    if current_token:
        merged_tokens.append(current_token)
        merged_labels.append(current_label)

    results = list(zip(merged_tokens, merged_labels))
    return results

sentence = "OpenAI adalah laboratorium penelitan kecerdasan buatan yang terdiri atas perusahaan waralaba OpenAI LP dan perusahaan induk nirlabanya, OpenAI Inc. Para pendirinya (sam altman) terdorong oleh ketakutan mereka akan kemungkinan bahwa kecerdasan buatan dapat mengancam keberadaan manusia, perusahaan ini ada di amerika serikat. PT. Indodana , salah satu perusahann di Indonesia mulai mengadopsi teknologi ini."
results = predict_ner(sentence)
print(results)
for token, label in results:
    print(f"{token}: {label}")

Dataset and citation info

@article{DBLP:journals/corr/abs-1810-04805,
  author    = {Jacob Devlin and
               Ming{-}Wei Chang and
               Kenton Lee and
               Kristina Toutanova},
  title     = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
               Understanding},
  journal   = {CoRR},
  volume    = {abs/1810.04805},
  year      = {2018},
  url       = {http://arxiv.org/abs/1810.04805},
  archivePrefix = {arXiv},
  eprint    = {1810.04805},
  timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}