Malaysian Qwen2.5 1.5B-Instruct

Continue finetuning meta-llama/Llama-3.2-1B-Instruct on highly curated 1.2B tokens Malaysian instruction.

Improvement

128k context length.
Support respond in Mandarin, Tamil, Jawi, Manglish, Johor, Kedah, Kelantan, Pahang, Perak, Sabah, Sarawak, Selangor, Negeri Sembilan and Terengganu.
Able to code in Mandarin, Tamil, Jawi, Manglish, Johor, Kedah, Kelantan, Pahang, Perak, Sabah, Sarawak, Selangor, Negeri Sembilan and Terengganu.
Multi-turn Malaysian context such as related to Malaysian Legislation, politics, religions and languages.
Standard RAG.

MalayMMLU

                             Model   Accuracy   shot by_letter        category
0  Malaysian-Llama-3.2-1B-Instruct  39.705280  0shot      True            STEM
1  Malaysian-Llama-3.2-1B-Instruct  42.286896  0shot      True        Language
2  Malaysian-Llama-3.2-1B-Instruct  41.196878  0shot      True  Social science
3  Malaysian-Llama-3.2-1B-Instruct  44.615016  0shot      True          Others
4  Malaysian-Llama-3.2-1B-Instruct  42.616610  0shot      True      Humanities
{'Social science': 6918, 'Language': 6288, 'Humanities': 4395, 'Others': 4169, 'STEM': 2443}
Model : Malaysian-Llama-3.2-1B-Instruct
Metric : first
Shot : 0shot
average accuracy 42.17569074464131
accuracy for STEM 39.70528039295947
accuracy for Language 42.286895674300254
accuracy for Social science 41.1968777103209
accuracy for Others 44.61501559126889
accuracy for Humanities 42.61660978384528

Training session

Finetune on mesolitica/Malaysian-SFT to make the model understand Malaysian context.

How we train

LoRA on ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "embed_tokens", "lm_head"].
256 Rank with alpha 512, or alpha of 2.0
Multipacking with proper SDPA causal masking to prevent document contamination and also make sure proper position ids.
Forked CCE loss for LoRA lm_head to reduce memory consumption.

Source code at https://github.com/malaysia-ai/cooking/tree/main/llama/sft

Example

Load the model,

from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
import torch

tokenizer = AutoTokenizer.from_pretrained('malaysia-ai/Malaysian-Llama-3.2-1B-Instruct')
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(
    'malaysia-ai/Malaysian-Llama-3.2-1B-Instruct', torch_dtype = torch.bfloat16
).cuda()

All examples are using stochastic sampling method, might not able to reproduce the same results on different machines.
Some examples might been truncated, too long for this README.

malaysia-ai
/

Malaysian-Llama-3.2-1B-Instruct