indot5-bloom-fft-v2 π
β οΈ Usage & License Disclaimer The raw evaluation data, matrices, and analytical findings presented in this Model Card are strictly associated with two upcoming academic publications:
- "Beyond Lexical Accuracy: Investigating the Structural and Semantic Compliance of idT5 in Indonesian Educational AQG"
- "Parameter-Efficient vs. Full Fine-Tuning for Indonesian Educational Question Generation: A Comparative Study on idT5"
Any extraction, reproduction, or use of the empirical data presented below without explicit prior authorization and citation of the forthcoming manuscripts is strictly prohibited to maintain academic integrity and prevent self-plagiarism issues prior to the official publication.
indot5-bloom-fft-v2 π
This model was trained using the Indo-Bloom Corpus (Available here), specifically optimized for Indonesian Automatic Question Generation (AQG) aligned with Bloom's Taxonomy.
Below are the comprehensive evaluation results benchmarking this model against both domain-specific Small Language Models (SLMs) and global Large Language Models (LLMs). The raw metrics can be found in hasil_evaluasi_joiv.csv attached to this repository.
π Lexical Similarity Performance (N-GRAM) Comparative results on how well the model mimics human-written questions:
| Model Strategy | BLEU | ROUGE-1 | ROUGE-2 | ROUGE-L |
|---|---|---|---|---|
| idT5-base (Muchad) | 0 | 2.76 | 0.27 | 2.63 |
| idT5-base-qaqg-v1.12 (Awalurahman [1]) | 11.03 | 34.45 | 12.1 | 32.14 |
| indot5-bloom-specialized-v2 (LoRA - Ours) | 17.79 | 44.39 | 29.07 | 43.61 |
| indot5-bloom-fft-v2 (FFT - Ours) | 17.38 | 42.58 | 27.16 | 41.35 |
π Pedagogical Compliance Performance (TCI) Evaluation of the model's ability to follow Bloom's Taxonomy directives (Operational Verbs/KKO):
| Model Strategy | Structural (TCI-C1) | TCI-C2 (%) | TCI-All (%) |
|---|---|---|---|
| idT5-base (Muchad) | 0% | 30 | 15 |
| idT5-base-qaqg-v1.12 (Awalurahman [1]) | 98% | 12 | 55 |
| indot5-bloom-specialized-v2 (LoRA - Ours) | 88% | 12 | 50 |
| indot5-bloom-fft-v2 (FFT - Ours) | 100% | 6 | 53 |
βοΈ Hybrid Evaluation Report (Lexical vs Structural vs Semantic) This table summarizes the core findings of the research, highlighting the gap between structural obedience and semantic reasoning.
| Model Strategy | Lexical (BLEU) | Structural (TCI-C1) | Semantic (AI-Judge) |
|---|---|---|---|
| idT5-base (Muchad) | 0 | 0% | 5% |
| idT5-base-qaqg-v1.12 (Awalurahman [1]) | 11.03 | 98% | 20% |
| indot5-bloom-specialized-v2 (LoRA - Ours) | 17.79 | 88% | 30% |
| indot5-bloom-fft-v2 (FFT - Ours) | 17.38 | 100% | 30% |
===============================================================================================
TABLE 1: LEXICAL SIMILARITY PERFORMANCE (MAIN METRICS) ===============================================================================================
| Model Strategy | BLEU | ROUGE-1 | ROUGE-2 | ROUGE-L |
|---|---|---|---|---|
| idT5-base-qaqg-v1.12 (Awalurahman [1]) | 11.03 | 34.45 | 12.1 | 32.14 |
| indot5-bloom-specialized-v2 (LoRA - Ours) | 17.79 | 44.39 | 29.07 | 43.61 |
| indot5-bloom-fft-v2 (FFT - Ours) | 17.38 | 42.58 | 27.16 | 41.35 |
| Qwen-2.5-3B-Instruct (Alibaba) | 3.38 | 28.28 | 7.9 | 25.15 |
| Qwen-2.5-7B-Instruct (Alibaba) | 6.3 | 28.98 | 9.02 | 25.68 |
| Llama-3.1-8B-Instruct (Meta) | 8.47 | 33.95 | 13.61 | 31.73 |
| Gemma-2-9B-IT (Google) | 12.62 | 39.88 | 18.77 | 38.35 |
| Mistral-7B-Instruct-v0.3 (Mistral) | 4.31 | 25.15 | 7.95 | 22.73 |
===============================================================================================
TABLE 2: LEXICAL SIMILARITY PERFORMANCE (N-GRAM DETAILS) ===============================================================================================
| Model Strategy | Dataset | BLEU 1 | BLEU 2 | BLEU 3 | BLEU 4 | ROUGE |
|---|---|---|---|---|---|---|
| idT5-base-qaqg-v1.12 (Awalurahman [1]) | Indo-Bloom (Test) | 0.35 | 0.14 | 0.07 | 0.05 | 32.14 |
| indot5-bloom-specialized-v2 (LoRA - Ours) | Indo-Bloom (Test) | 0.43 | 0.22 | 0.13 | 0.09 | 43.61 |
| indot5-bloom-fft-v2 (FFT - Ours) | Indo-Bloom (Test) | 0.41 | 0.2 | 0.13 | 0.09 | 41.35 |
| Qwen-2.5-3B-Instruct (Alibaba) | Indo-Bloom (Test) | 0.23 | 0.05 | 0.01 | 0.01 | 25.15 |
| Qwen-2.5-7B-Instruct (Alibaba) | Indo-Bloom (Test) | 0.27 | 0.08 | 0.03 | 0.02 | 25.68 |
| Llama-3.1-8B-Instruct (Meta) | Indo-Bloom (Test) | 0.29 | 0.11 | 0.05 | 0.03 | 31.73 |
| Gemma-2-9B-IT (Google) | Indo-Bloom (Test) | 0.34 | 0.15 | 0.09 | 0.06 | 38.35 |
| Mistral-7B-Instruct-v0.3 (Mistral) | Indo-Bloom (Test) | 0.19 | 0.05 | 0.02 | 0.02 | 22.73 |
===============================================================================================
TABLE 3: PEDAGOGICAL COMPLIANCE PERFORMANCE (TCI) ===============================================================================================
| Model Strategy | Structural (TCI-C1) | TCI-C2 (%) | TCI-All (%) |
|---|---|---|---|
| idT5-base-qaqg-v1.12 (Awalurahman [1]) | 98% | 12 | 55 |
| indot5-bloom-specialized-v2 (LoRA - Ours) | 88% | 12 | 50 |
| indot5-bloom-fft-v2 (FFT - Ours) | 100% | 6 | 53 |
| Qwen-2.5-3B-Instruct (Alibaba) | 86% | 52 | 69 |
| Qwen-2.5-7B-Instruct (Alibaba) | 66% | 16 | 41 |
| Llama-3.1-8B-Instruct (Meta) | 84% | 8 | 46 |
| Gemma-2-9B-IT (Google) | 90% | 26 | 58 |
| Mistral-7B-Instruct-v0.3 (Mistral) | 84% | 8 | 46 |
===============================================================================================
TABLE 4: HYBRID EVALUATION REPORT (SLM vs LLM) ===============================================================================================
| Model Strategy | Lexical (BLEU) | Structural (TCI-C1) | Semantic (AI-Judge) |
|---|---|---|---|
| idT5-base-qaqg-v1.12 (Awalurahman [1]) | 11.03 | 98% | 20% |
| indot5-bloom-specialized-v2 (LoRA - Ours) | 17.79 | 88% | 30% |
| indot5-bloom-fft-v2 (FFT - Ours) | 17.38 | 100% | 30% |
| Qwen-2.5-3B-Instruct (Alibaba) | 3.38 | 86% | 16% |
| Qwen-2.5-7B-Instruct (Alibaba) | 6.3 | 66% | 27% |
| Llama-3.1-8B-Instruct (Meta) | 8.47 | 84% | 17% |
| Gemma-2-9B-IT (Google) | 12.62 | 90% | 20% |
| Mistral-7B-Instruct-v0.3 (Mistral) | 4.31 | 84% | 19% |
π Training Convergence (FFT Phase)
The model reached stable convergence with a final Validation Loss of 7.32 at epoch 30.
| Phase | Epoch | Training Loss | Validation Loss |
|---|---|---|---|
| Final | 30 | 29.9418 | 7.3284 |
π‘ Key Findings
- The FFT-C1 Mastery: The FFT model achieves a perfect 100% score for C1 questions, proving absolute structural discipline.
- LLM Pedagogical Disobedience: Massive models like Llama-3.1 and Qwen struggle to follow rigid Bloom's Taxonomy syntax, resulting in lower structural compliance compared to the domain-adapted idT5.
- The Structural-Semantic Gap: While structural compliance is high across the board for C1, the semantic agreement drops across all models for C2 questions, indicating that models learn the syntax of higher-order questions better than their deep meaning.
- Future Work: These results strongly justify the transition to Representation Engineering to bridge this semantic gap.
π Citations
1. Ibrahim, F. (2026). Beyond Lexical Accuracy: Investigating the Structural and Semantic Compliance of idT5 in Indonesian Educational AQG. [Manuscript in preparation] 2. Ibrahim, F. (2026). Parameter-Efficient vs. Full Fine-Tuning for Indonesian Educational Question Generation: A Comparative Study on idT5. [Manuscript in preparation]
- Downloads last month
- 550