Model Card for Model Jawi Word Sense Disambiguation Model

Model Details

Overview

This model performs word sense disambiguation for Jawi script text, specifically addressing cases where the same written form can correspond to multiple different words in Malay. The model is fine-tuned from SEA-LION (Southeast Asian Language Intelligence Open Network), specifically the 7B instruction-tuned research variant.

Model Architecture

Base model: sea-lion-7b-instruct-research
Model type: Large Language Model
Fine-tuning approach: PEFT

Intended Use

Primary use: Disambiguating Jawi script words that have multiple possible interpretations in Malay
Intended users: Researchers, linguists, and developers working with Malay/Jawi text processing
Out-of-scope use cases: Should not be used for general Malay-language translation or text generation

Factors

Relevant Factors

Different vowelization patterns in Jawi script that map to the same consonantal skeleton
Context dependency of word meanings
Regional variations in Malay vocabulary usage

Evaluation Factors

Accuracy of disambiguation across different word types
Performance variation based on context length and quality
Handling of regional variations

Training Data

Dataset Description

Source of training data: mevsg/bntng-disambiguation-v1

Example Cases

Example of ambiguous forms handled:

بنتڠ (bntng) can be disambiguated into:
- banteng (wild ox)
- banting (to throw down)
- bentang (to spread out)
- benteng (fortification)
- bintang (star)

Ethical Considerations

Bias and Fairness

Potential biases in training data regarding regional variations of Malay
Representation of different Malay-speaking communities
Impact of disambiguation errors on downstream applications

Risks and Harms

Potential misinterpretation in sensitive contexts (legal, historical documents)
Impact of errors on cultural heritage preservation
Considerations for automated systems relying on this model

Model Performance Limitations

Known edge cases where disambiguation may fail
Performance limitations with insufficient context
Handling of out-of-vocabulary or rare words
Regional variation coverage limitations

Additional Information

Version

Model version: 1
Last updated: 18 October 2024

How to Use

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

# Load the tokenizer from the base model
base_model_path = "aisingapore/sea-lion-7b"
tokenizer = AutoTokenizer.from_pretrained(base_model_path, trust_remote_code=True)

# Load the fine-tuned model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
peft_model_path = "mevsg/bntng-dis-v2"
model = AutoModelForCausalLM.from_pretrained(peft_model_path, trust_remote_code=True) 
model.to(device)

# Example usage
prompt = "Apakah perkataan yang sesuai menggantikan 'bntng' dalam ayat berikut:"
input_text = "Langit malam ini penuh dengan bntng yang bersinar terang."
full_prompt = f"### USER:\n{prompt} {input_text}\n\n### RESPONSE:\n"

tokens = tokenizer(full_prompt, return_tensors="pt")
tokens.to(device)
output = model.generate(
    tokens["input_ids"],
    attention_mask=tokens["attention_mask"],
    max_new_tokens=20, 
    eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(output[0], skip_special_tokens=True))