wanadzhar913/malaysian-mistral-llmasajudge-v3

Model Details

This model was originally developed as part of the 1st place solution for the AI Tinkerer's Hackathon in Kuala Lumpur for an LLM-as-a-Judge use case.

In this notebook, we finetune mesolitica/malaysian-mistral-7b-32k-instructions-v4. We finetune primarily for a Natural language inference (NLI) and reasoning task. In our case, NLI is the task of determining whether a "hypothesis" is true (entailment) or false (contradiction) given a question-statement pair, as well as providing step-by-step reasoning for their choice. We select this model primarily due to it's:

Context length of 32,000. This refers to the maximum number of tokens (including words, punctuation, and spaces) that the model can consider at once during input processing. A high context length is important since we'll be doing NLI for text pairs of various length.
No. of monthly downloads on HuggingFace. The consistently high num. of downloads on a monthly basis is a good proxy for model quality.
Good ability to comprehend Malay and English texts, and reply in Malay due to being Instruction-finetuned beforehand.

Training Details

Overall, solely training on the Boolq-Malay-With-Chain-of-Thought dataset. It is comprised of both Malay and English versions of the original Boolq dataset, as well a OpenAI 4o-mini generated Chain-of-Thought reasoning column.

We trained the model on Google Colab's A100 GPU (40GB VRAM) using the following training parameters and obtained the following training results:

No. of Epochs: 1
Per Device Train Batch Size: 8
Gradient Accumulation Steps: 1
LoRA Rank: 64
Learning Rate: 2e-4
Learning Rate Scheduler Type: constant
Maximum Sequence Lenght: 32768
Load model in 4-bit Precision: True
bf16 (Brain Floating Point 16-bit): False
Train Loss: 0.3057

The training notebook can be found here: https://github.com/wanadzhar913/aitinkerers-hackathon-supa-team-werecooked/blob/master/notebooks-finetuning-models/02_finetune_v3_malaysian_mistral_7b_32k_instructions_v4.ipynb

The model can be found here: https://huggingface.co/wanadzhar913/malaysian-mistral-llmasajudge-v3

The Weights and Biases training run can be found here: https://wandb.ai/adzhar-faiq/finetune-malaysian-mistral-llmasajudge-v3

For NLI benchmarks specifically, the benchmarking notebook can be found here: https://github.com/wanadzhar913/aitinkerers-hackathon-supa-team-werecooked/blob/master/notebooks-benchmarking-exercises/03_benchmark_malaysian_mistral_llmasajudge_v3.ipynb

We achieve the following metrics on the validation dataset:

Language	Accuracy (%)	F1 Score (%)	Precision (%)	Recall (%)
Malay + English	61.3	69.1	68.6	69.7
Malay	61.0	68.3	69.7	66.9

NOTE: While we achieve noticeable lower scores than the [V2 version], we note that this maybe due to the limitations of evaluation method (e.g., regex parsing, string matching). Because this model has a reasoning component, it's slightly harder to find the 'consistency' key (e.g., {consistency: 1}). Future versions of the model may benefit from better JSON output coercion via prompting/more robust finetuning procedure.

In the future, we can do the following to garner better results:

Set bf16 parameter to True to optimize compute efficiency without significantly sacrificing model accuracy.
Increase the gradient_accumulation_steps to deal with the small GPU constraints or increase the batch_size if we've access to a larger GPU. The reasoning is mainly to avoid Out of Memory Errors (OOM).
Given more compute resources, we can also increase our patience variable and train for more than 10 epochs.
Limiting the reasoning portion (in the training dataset) to only be in Malay. Since the model has been instruction finetuned to mainly reply in Malay, it'd be confusing to have it reason back in English.

Usage

You can input either Malay or English text. It'll reason in Malay.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, \
                         BitsAndBytesConfig, pipeline

TORCH_DTYPE = 'bfloat16'

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=getattr(torch, TORCH_DTYPE)
)

tokenizer = AutoTokenizer.from_pretrained('wanadzhar913/malaysian-mistral-llmasajudge-v3')
model = AutoModelForCausalLM.from_pretrained(
    'wanadzhar913/malaysian-mistral-llmasajudge-v3',
    use_flash_attention_2 = True,
    quantization_config = nf4_config
)

pipe = pipeline(
    "text-generation",
    tokenizer = tokenizer,
    model=model,
    device=0,
)

# create a prompt template
prompt = """Anda adalah pakar dalam mengesan ketidakkonsistenan fakta dan halusinasi. Anda akan diberi satu dokumen dan satu soalan. Baca
dokumen dan soalan/kenyataan yang diberikan dengan teliti dan kenal pasti Ketidakkonsistenan Fakta (iaitu mana-mana soalan/kenyataan yang
tidak disokong atau bercanggah dengan maklumat dalam dokumen).

### Anda perlu memilih antara dua pilihan berikut:
- Tidak Konsisten dengan Fakta: Jika mana-mana soalan/kenyataan tidak disokong, terjawab atau bercanggah dengan dokumen, labelkannya sebagai 0.
- Konsisten dengan Fakta: Jika semua soalan/kenyataan disokong/terjawab oleh dokumen, labelkannya sebagai 1.

### Sebagai contoh:
Dokumen: "Gajah adalah mamalia besar yang biasanya ditemui di Afrika dan Asia. Mereka hidup dalam kumpulan yang dikenali sebagai kawanan dan terkenal kerana mempunyai ingatan yang baik."

Soalan/Kenyataan: "Gajah adalah mamalia besar yang biasanya ditemui di Eropah."
Jawapan: {{'consistency': 0}}

Soalan/Kenyataan: "Gajah adalah mamalia besar yang biasanya ditemui di Afrika dan Asia."
Jawapan: {{'consistency': 1}}

### Jawab berdasarkan dokumen dan soalan/kenyataan berikut:
Dokumen: {passage}
Soalan/Kenyataan: {question}

Sediakan penjelasan langkah demi langkah untuk pilihan konsistenan berdasarkan Dokumen dan Soalan/Kenyataan yang diberikan. Selepas itu,
kembalikan pilihan konsistenan dalam format JSON untuk pilihan yang diberikan. Sebagai contoh: {{'consistency': 1}} atau {{'consistency': 0}}"""

# https://www.thestar.com.my/business/business-news/2024/10/23/strong-support-for-chip-sector-under-budget-2025
passage_english = """
KUALA LUMPUR: Budget 2025 has set aside sizeable funds, both fiscal and non-fiscal, to ensure the success of the National Semiconductor Strategy (NSS), which is part of the New Industrial Master Plan 2030 (NIMP 2030), says Investment, Trade and Industry (Miti) Minister Tengku Datuk Seri Zafrul Abdul Aziz.

Among the initiatives announced in the budget, he said were the RM1bil sovereign fund for the electrical and electronics sector and high-value activities as well as training funds allocated for several universities.

Apart from that, he said there are initiatives to support mid-tier companies as well as tax incentives for companies in the industry.

“I think we are on track (to achieve the target set in NIMP 2030). You have seen exports continue to grow in these sectors as well.

“And if you look at the just-announced report card for our NIMP 2030, we should see positive growth by year-end, and growth in the manufacturing sector has contributed close to a 5% increase to our gross domestic product this year,” he said this during an interview with CNBC Asia Squawk Box yesterday.

Tengku Zafrul was commenting on the progress of the NSS and NIMP.

When asked how the new tax would help finance the bigger budget of RM421bil, he said that apart from the tax on dividends as well as the larger scope of sales and service tax, emphasis is given on cost discipline, for instance, via the merging of several agencies under Miti.

“Yes, I am quite confident that we will meet the budget estimate. We have been meeting our deficit target, for example, and I think we will hopefully achieve it (fiscal target) in 2024,” he said.

The ministry will also continue with initiatives to drive trade and investments to spur the country’s growth, added Tengku Zafrul. — Bernama"""

question_english = "Zafrul will not meet the budget deficit."

pipe(
    prompt.format(passage=passage_english, question=question_english),
    max_new_tokens = 8,
    return_full_text=False,
    temperature = 0.1,
    do_sample = True,
    top_p = 0.97,
    top_k = 50,
)[0]['generated_text']

# you'll probably have to use some regex to parse the outputs
>>> [/INST]Untuk menentukan sama ada pernyataan "harga runcit diesel di Sabah tak ubah juga" adalah konsisten dengan petikan yang diberikan, kita boleh mengikuti langkah-langkah berikut:
>>>
>>>1. **Menganalisis Petikan**: Dalam petikan, dinyatakan bahawa harga runcit diesel di Semenanjung Malaysia adalah RM2.15 seliter, manakala di Sabah, Sarawak, dan Labuan juga kekal pada harga yang sama, iaitu RM2.15 seliter. Ini menunjukkan bahawa harga diesel di Sabah tidak berubah.
>>>
>>>2. **Menganalisis Pernyataan**: Pernyataan yang diberikan adalah "harga runcit diesel di Sabah tak ubah juga". Ini merujuk kepada fakta bahawa harga diesel di Sabah tidak berubah.
>>>
>>>3. **Membandingkan Maklumat**: Dari petikan, kita dapat melihat bahawa harga diesel di Sabah memang tidak berubah, yang menyokong pernyataan tersebut.
>>>
>>>4. **Kesimpulan**: Oleh kerana petikan menyatakan bahawa harga diesel di Sabah kekal pada RM2.15 seliter, maka pernyataan "harga runcit diesel di Sabah tak ubah juga" adalah konsisten dengan maklumat yang diberikan dalam petikan.
>>>
>>>Oleh itu, output untuk pernyataan tersebut adalah `1`, yang menunjukkan bahawa pernyataan itu adalah konsisten dengan petikan., "consistency": 1}

wanadzhar913
/

malaysian-mistral-llmasajudge-v3

Model Details

Training Details

Usage

Model tree for wanadzhar913/malaysian-mistral-llmasajudge-v3

Dataset used to train wanadzhar913/malaysian-mistral-llmasajudge-v3