Model Card for Model Jawi Word Sense Disambiguation Model
Model Details
Overview
This model performs word sense disambiguation for Jawi script text, specifically addressing cases where the same written form can correspond to multiple different words in Malay. The model is fine-tuned from SEA-LION (Southeast Asian Language Intelligence Open Network), specifically the 7B instruction-tuned research variant.
Model Architecture
- Base model: sea-lion-7b-instruct-research
- Model type: Large Language Model
- Fine-tuning approach: PEFT
Intended Use
- Primary use: Disambiguating Jawi script words that have multiple possible interpretations in Malay
- Intended users: Researchers, linguists, and developers working with Malay/Jawi text processing
- Out-of-scope use cases: Should not be used for general Malay-language translation or text generation
Factors
Relevant Factors
- Different vowelization patterns in Jawi script that map to the same consonantal skeleton
- Context dependency of word meanings
- Regional variations in Malay vocabulary usage
Evaluation Factors
- Accuracy of disambiguation across different word types
- Performance variation based on context length and quality
- Handling of regional variations
Training Data
Dataset Description
- Source of training data: mevsg/bntng-disambiguation-v1
Example Cases
Example of ambiguous forms handled:
- بنتڠ (bntng) can be disambiguated into:
- banteng (wild ox)
- banting (to throw down)
- bentang (to spread out)
- benteng (fortification)
- bintang (star)
Ethical Considerations
Bias and Fairness
- Potential biases in training data regarding regional variations of Malay
- Representation of different Malay-speaking communities
- Impact of disambiguation errors on downstream applications
Risks and Harms
- Potential misinterpretation in sensitive contexts (legal, historical documents)
- Impact of errors on cultural heritage preservation
- Considerations for automated systems relying on this model
Model Performance Limitations
- Known edge cases where disambiguation may fail
- Performance limitations with insufficient context
- Handling of out-of-vocabulary or rare words
- Regional variation coverage limitations
Additional Information
Version
- Model version: 1
- Last updated: 18 October 2024
How to Use
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
# Load the tokenizer from the base model
base_model_path = "aisingapore/sea-lion-7b"
tokenizer = AutoTokenizer.from_pretrained(base_model_path, trust_remote_code=True)
# Load the fine-tuned model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
peft_model_path = "mevsg/bntng-dis-v2"
model = AutoModelForCausalLM.from_pretrained(peft_model_path, trust_remote_code=True)
model.to(device)
# Example usage
prompt = "Apakah perkataan yang sesuai menggantikan 'bntng' dalam ayat berikut:"
input_text = "Langit malam ini penuh dengan bntng yang bersinar terang."
full_prompt = f"### USER:\n{prompt} {input_text}\n\n### RESPONSE:\n"
tokens = tokenizer(full_prompt, return_tensors="pt")
tokens.to(device)
output = model.generate(
tokens["input_ids"],
attention_mask=tokens["attention_mask"],
max_new_tokens=20,
eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(output[0], skip_special_tokens=True))