indot5-bloom-fft-v2 πŸš€

⚠️ Usage & License Disclaimer The raw evaluation data, matrices, and analytical findings presented in this Model Card are strictly associated with two upcoming academic publications:

  1. "Beyond Lexical Accuracy: Investigating the Structural and Semantic Compliance of idT5 in Indonesian Educational AQG"
  2. "Parameter-Efficient vs. Full Fine-Tuning for Indonesian Educational Question Generation: A Comparative Study on idT5"

Any extraction, reproduction, or use of the empirical data presented below without explicit prior authorization and citation of the forthcoming manuscripts is strictly prohibited to maintain academic integrity and prevent self-plagiarism issues prior to the official publication.

indot5-bloom-fft-v2 πŸš€

This model was trained using the Indo-Bloom Corpus (Available here), specifically optimized for Indonesian Automatic Question Generation (AQG) aligned with Bloom's Taxonomy.

Below are the comprehensive evaluation results benchmarking this model against both domain-specific Small Language Models (SLMs) and global Large Language Models (LLMs). The raw metrics can be found in hasil_evaluasi_joiv.csv attached to this repository.


πŸ“Š Lexical Similarity Performance (N-GRAM) Comparative results on how well the model mimics human-written questions:

Model Strategy BLEU ROUGE-1 ROUGE-2 ROUGE-L
idT5-base (Muchad) 0 2.76 0.27 2.63
idT5-base-qaqg-v1.12 (Awalurahman [1]) 11.03 34.45 12.1 32.14
indot5-bloom-specialized-v2 (LoRA - Ours) 17.79 44.39 29.07 43.61
indot5-bloom-fft-v2 (FFT - Ours) 17.38 42.58 27.16 41.35

πŸŽ“ Pedagogical Compliance Performance (TCI) Evaluation of the model's ability to follow Bloom's Taxonomy directives (Operational Verbs/KKO):

Model Strategy Structural (TCI-C1) TCI-C2 (%) TCI-All (%)
idT5-base (Muchad) 0% 30 15
idT5-base-qaqg-v1.12 (Awalurahman [1]) 98% 12 55
indot5-bloom-specialized-v2 (LoRA - Ours) 88% 12 50
indot5-bloom-fft-v2 (FFT - Ours) 100% 6 53

βš–οΈ Hybrid Evaluation Report (Lexical vs Structural vs Semantic) This table summarizes the core findings of the research, highlighting the gap between structural obedience and semantic reasoning.

Model Strategy Lexical (BLEU) Structural (TCI-C1) Semantic (AI-Judge)
idT5-base (Muchad) 0 0% 5%
idT5-base-qaqg-v1.12 (Awalurahman [1]) 11.03 98% 20%
indot5-bloom-specialized-v2 (LoRA - Ours) 17.79 88% 30%
indot5-bloom-fft-v2 (FFT - Ours) 17.38 100% 30%

===============================================================================================
TABLE 1: LEXICAL SIMILARITY PERFORMANCE (MAIN METRICS) ===============================================================================================

Model Strategy BLEU ROUGE-1 ROUGE-2 ROUGE-L
idT5-base-qaqg-v1.12 (Awalurahman [1]) 11.03 34.45 12.1 32.14
indot5-bloom-specialized-v2 (LoRA - Ours) 17.79 44.39 29.07 43.61
indot5-bloom-fft-v2 (FFT - Ours) 17.38 42.58 27.16 41.35
Qwen-2.5-3B-Instruct (Alibaba) 3.38 28.28 7.9 25.15
Qwen-2.5-7B-Instruct (Alibaba) 6.3 28.98 9.02 25.68
Llama-3.1-8B-Instruct (Meta) 8.47 33.95 13.61 31.73
Gemma-2-9B-IT (Google) 12.62 39.88 18.77 38.35
Mistral-7B-Instruct-v0.3 (Mistral) 4.31 25.15 7.95 22.73

===============================================================================================
TABLE 2: LEXICAL SIMILARITY PERFORMANCE (N-GRAM DETAILS) ===============================================================================================

Model Strategy Dataset BLEU 1 BLEU 2 BLEU 3 BLEU 4 ROUGE
idT5-base-qaqg-v1.12 (Awalurahman [1]) Indo-Bloom (Test) 0.35 0.14 0.07 0.05 32.14
indot5-bloom-specialized-v2 (LoRA - Ours) Indo-Bloom (Test) 0.43 0.22 0.13 0.09 43.61
indot5-bloom-fft-v2 (FFT - Ours) Indo-Bloom (Test) 0.41 0.2 0.13 0.09 41.35
Qwen-2.5-3B-Instruct (Alibaba) Indo-Bloom (Test) 0.23 0.05 0.01 0.01 25.15
Qwen-2.5-7B-Instruct (Alibaba) Indo-Bloom (Test) 0.27 0.08 0.03 0.02 25.68
Llama-3.1-8B-Instruct (Meta) Indo-Bloom (Test) 0.29 0.11 0.05 0.03 31.73
Gemma-2-9B-IT (Google) Indo-Bloom (Test) 0.34 0.15 0.09 0.06 38.35
Mistral-7B-Instruct-v0.3 (Mistral) Indo-Bloom (Test) 0.19 0.05 0.02 0.02 22.73

===============================================================================================
TABLE 3: PEDAGOGICAL COMPLIANCE PERFORMANCE (TCI) ===============================================================================================

Model Strategy Structural (TCI-C1) TCI-C2 (%) TCI-All (%)
idT5-base-qaqg-v1.12 (Awalurahman [1]) 98% 12 55
indot5-bloom-specialized-v2 (LoRA - Ours) 88% 12 50
indot5-bloom-fft-v2 (FFT - Ours) 100% 6 53
Qwen-2.5-3B-Instruct (Alibaba) 86% 52 69
Qwen-2.5-7B-Instruct (Alibaba) 66% 16 41
Llama-3.1-8B-Instruct (Meta) 84% 8 46
Gemma-2-9B-IT (Google) 90% 26 58
Mistral-7B-Instruct-v0.3 (Mistral) 84% 8 46

===============================================================================================
TABLE 4: HYBRID EVALUATION REPORT (SLM vs LLM) ===============================================================================================

Model Strategy Lexical (BLEU) Structural (TCI-C1) Semantic (AI-Judge)
idT5-base-qaqg-v1.12 (Awalurahman [1]) 11.03 98% 20%
indot5-bloom-specialized-v2 (LoRA - Ours) 17.79 88% 30%
indot5-bloom-fft-v2 (FFT - Ours) 17.38 100% 30%
Qwen-2.5-3B-Instruct (Alibaba) 3.38 86% 16%
Qwen-2.5-7B-Instruct (Alibaba) 6.3 66% 27%
Llama-3.1-8B-Instruct (Meta) 8.47 84% 17%
Gemma-2-9B-IT (Google) 12.62 90% 20%
Mistral-7B-Instruct-v0.3 (Mistral) 4.31 84% 19%

πŸ“ˆ Training Convergence (FFT Phase)

The model reached stable convergence with a final Validation Loss of 7.32 at epoch 30.

Phase Epoch Training Loss Validation Loss
Final 30 29.9418 7.3284

πŸ’‘ Key Findings

  • The FFT-C1 Mastery: The FFT model achieves a perfect 100% score for C1 questions, proving absolute structural discipline.
  • LLM Pedagogical Disobedience: Massive models like Llama-3.1 and Qwen struggle to follow rigid Bloom's Taxonomy syntax, resulting in lower structural compliance compared to the domain-adapted idT5.
  • The Structural-Semantic Gap: While structural compliance is high across the board for C1, the semantic agreement drops across all models for C2 questions, indicating that models learn the syntax of higher-order questions better than their deep meaning.
  • Future Work: These results strongly justify the transition to Representation Engineering to bridge this semantic gap.

πŸ“ Citations

1. Ibrahim, F. (2026). Beyond Lexical Accuracy: Investigating the Structural and Semantic Compliance of idT5 in Indonesian Educational AQG. [Manuscript in preparation] 2. Ibrahim, F. (2026). Parameter-Efficient vs. Full Fine-Tuning for Indonesian Educational Question Generation: A Comparative Study on idT5. [Manuscript in preparation]

Downloads last month
550
Safetensors
Model size
0.3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support