Model Card for Model Jawi Word Sense Disambiguation Model

Model Details

Overview

This model performs word sense disambiguation for Jawi script text, specifically addressing cases where the same written form can correspond to multiple different words in Malay. The model is fine-tuned from SEA-LION (Southeast Asian Language Intelligence Open Network), specifically the 7B instruction-tuned research variant.

Model Architecture

Intended Use

  • Primary use: Disambiguating Jawi script words that have multiple possible interpretations in Malay
  • Intended users: Researchers, linguists, and developers working with Malay/Jawi text processing
  • Out-of-scope use cases: Should not be used for general Malay-language translation or text generation

Factors

Relevant Factors

  • Different vowelization patterns in Jawi script that map to the same consonantal skeleton
  • Context dependency of word meanings
  • Regional variations in Malay vocabulary usage

Evaluation Factors

  • Accuracy of disambiguation across different word types
  • Performance variation based on context length and quality
  • Handling of regional variations

Training Data

Dataset Description

Example Cases

Example of ambiguous forms handled:

  • بنتڠ (bntng) can be disambiguated into:
    • banteng (wild ox)
    • banting (to throw down)
    • bentang (to spread out)
    • benteng (fortification)
    • bintang (star)

Ethical Considerations

Bias and Fairness

  • Potential biases in training data regarding regional variations of Malay
  • Representation of different Malay-speaking communities
  • Impact of disambiguation errors on downstream applications

Risks and Harms

  • Potential misinterpretation in sensitive contexts (legal, historical documents)
  • Impact of errors on cultural heritage preservation
  • Considerations for automated systems relying on this model

Model Performance Limitations

  • Known edge cases where disambiguation may fail
  • Performance limitations with insufficient context
  • Handling of out-of-vocabulary or rare words
  • Regional variation coverage limitations

Additional Information

Version

  • Model version: 1
  • Last updated: 18 October 2024

How to Use

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

# Load the tokenizer from the base model
base_model_path = "aisingapore/sea-lion-7b"
tokenizer = AutoTokenizer.from_pretrained(base_model_path, trust_remote_code=True)

# Load the fine-tuned model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
peft_model_path = "mevsg/bntng-dis-v2"
model = AutoModelForCausalLM.from_pretrained(peft_model_path, trust_remote_code=True) 
model.to(device)

# Example usage
prompt = "Apakah perkataan yang sesuai menggantikan 'bntng' dalam ayat berikut:"
input_text = "Langit malam ini penuh dengan bntng yang bersinar terang."
full_prompt = f"### USER:\n{prompt} {input_text}\n\n### RESPONSE:\n"

tokens = tokenizer(full_prompt, return_tensors="pt")
tokens.to(device)
output = model.generate(
    tokens["input_ids"],
    attention_mask=tokens["attention_mask"],
    max_new_tokens=20, 
    eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model’s pipeline type. Check the docs .