indot5-bloom-fft-v2 🚀

⚠️ Usage & License Disclaimer The raw evaluation data, matrices, and analytical findings presented in this Model Card are strictly associated with two upcoming academic publications:

"Beyond Lexical Accuracy: Investigating the Structural and Semantic Compliance of idT5 in Indonesian Educational AQG"

"Parameter-Efficient vs. Full Fine-Tuning for Indonesian Educational Question Generation: A Comparative Study on idT5"

Any extraction, reproduction, or use of the empirical data presented below without explicit prior authorization and citation of the forthcoming manuscripts is strictly prohibited to maintain academic integrity and prevent self-plagiarism issues prior to the official publication.

indot5-bloom-fft-v2 🚀

This model was trained using the Indo-Bloom Corpus (Available here), specifically optimized for Indonesian Automatic Question Generation (AQG) aligned with Bloom's Taxonomy.

Below are the comprehensive evaluation results benchmarking this model against both domain-specific Small Language Models (SLMs) and global Large Language Models (LLMs). The raw metrics can be found in hasil_evaluasi_joiv.csv attached to this repository.

📊 Lexical Similarity Performance (N-GRAM) Comparative results on how well the model mimics human-written questions:

Model Strategy	BLEU	ROUGE-1	ROUGE-2	ROUGE-L
idT5-base (Muchad)	0	2.76	0.27	2.63
idT5-base-qaqg-v1.12 (Awalurahman [1])	11.03	34.45	12.1	32.14
indot5-bloom-specialized-v2 (LoRA - Ours)	17.79	44.39	29.07	43.61
indot5-bloom-fft-v2 (FFT - Ours)	17.38	42.58	27.16	41.35

🎓 Pedagogical Compliance Performance (TCI) Evaluation of the model's ability to follow Bloom's Taxonomy directives (Operational Verbs/KKO):

Model Strategy	Structural (TCI-C1)	TCI-C2 (%)	TCI-All (%)
idT5-base (Muchad)	0%	30	15
idT5-base-qaqg-v1.12 (Awalurahman [1])	98%	12	55
indot5-bloom-specialized-v2 (LoRA - Ours)	88%	12	50
indot5-bloom-fft-v2 (FFT - Ours)	100%	6	53

⚖️ Hybrid Evaluation Report (Lexical vs Structural vs Semantic) This table summarizes the core findings of the research, highlighting the gap between structural obedience and semantic reasoning.

Model Strategy	Lexical (BLEU)	Structural (TCI-C1)	Semantic (AI-Judge)
idT5-base (Muchad)	0	0%	5%
idT5-base-qaqg-v1.12 (Awalurahman [1])	11.03	98%	20%
indot5-bloom-specialized-v2 (LoRA - Ours)	17.79	88%	30%
indot5-bloom-fft-v2 (FFT - Ours)	17.38	100%	30%

===============================================================================================
TABLE 1: LEXICAL SIMILARITY PERFORMANCE (MAIN METRICS) ===============================================================================================

Model Strategy	BLEU	ROUGE-1	ROUGE-2	ROUGE-L
idT5-base-qaqg-v1.12 (Awalurahman [1])	11.03	34.45	12.1	32.14
indot5-bloom-specialized-v2 (LoRA - Ours)	17.79	44.39	29.07	43.61
indot5-bloom-fft-v2 (FFT - Ours)	17.38	42.58	27.16	41.35
Qwen-2.5-3B-Instruct (Alibaba)	3.38	28.28	7.9	25.15
Qwen-2.5-7B-Instruct (Alibaba)	6.3	28.98	9.02	25.68
Llama-3.1-8B-Instruct (Meta)	8.47	33.95	13.61	31.73
Gemma-2-9B-IT (Google)	12.62	39.88	18.77	38.35
Mistral-7B-Instruct-v0.3 (Mistral)	4.31	25.15	7.95	22.73

===============================================================================================
TABLE 2: LEXICAL SIMILARITY PERFORMANCE (N-GRAM DETAILS) ===============================================================================================

Model Strategy	Dataset	BLEU 1	BLEU 2	BLEU 3	BLEU 4	ROUGE
idT5-base-qaqg-v1.12 (Awalurahman [1])	Indo-Bloom (Test)	0.35	0.14	0.07	0.05	32.14
indot5-bloom-specialized-v2 (LoRA - Ours)	Indo-Bloom (Test)	0.43	0.22	0.13	0.09	43.61
indot5-bloom-fft-v2 (FFT - Ours)	Indo-Bloom (Test)	0.41	0.2	0.13	0.09	41.35
Qwen-2.5-3B-Instruct (Alibaba)	Indo-Bloom (Test)	0.23	0.05	0.01	0.01	25.15
Qwen-2.5-7B-Instruct (Alibaba)	Indo-Bloom (Test)	0.27	0.08	0.03	0.02	25.68
Llama-3.1-8B-Instruct (Meta)	Indo-Bloom (Test)	0.29	0.11	0.05	0.03	31.73
Gemma-2-9B-IT (Google)	Indo-Bloom (Test)	0.34	0.15	0.09	0.06	38.35
Mistral-7B-Instruct-v0.3 (Mistral)	Indo-Bloom (Test)	0.19	0.05	0.02	0.02	22.73

===============================================================================================
TABLE 3: PEDAGOGICAL COMPLIANCE PERFORMANCE (TCI) ===============================================================================================

Model Strategy	Structural (TCI-C1)	TCI-C2 (%)	TCI-All (%)
idT5-base-qaqg-v1.12 (Awalurahman [1])	98%	12	55
indot5-bloom-specialized-v2 (LoRA - Ours)	88%	12	50
indot5-bloom-fft-v2 (FFT - Ours)	100%	6	53
Qwen-2.5-3B-Instruct (Alibaba)	86%	52	69
Qwen-2.5-7B-Instruct (Alibaba)	66%	16	41
Llama-3.1-8B-Instruct (Meta)	84%	8	46
Gemma-2-9B-IT (Google)	90%	26	58
Mistral-7B-Instruct-v0.3 (Mistral)	84%	8	46

===============================================================================================
TABLE 4: HYBRID EVALUATION REPORT (SLM vs LLM) ===============================================================================================

Model Strategy	Lexical (BLEU)	Structural (TCI-C1)	Semantic (AI-Judge)
idT5-base-qaqg-v1.12 (Awalurahman [1])	11.03	98%	20%
indot5-bloom-specialized-v2 (LoRA - Ours)	17.79	88%	30%
indot5-bloom-fft-v2 (FFT - Ours)	17.38	100%	30%
Qwen-2.5-3B-Instruct (Alibaba)	3.38	86%	16%
Qwen-2.5-7B-Instruct (Alibaba)	6.3	66%	27%
Llama-3.1-8B-Instruct (Meta)	8.47	84%	17%
Gemma-2-9B-IT (Google)	12.62	90%	20%
Mistral-7B-Instruct-v0.3 (Mistral)	4.31	84%	19%

📈 Training Convergence (FFT Phase)

The model reached stable convergence with a final Validation Loss of 7.32 at epoch 30.

Phase	Epoch	Training Loss	Validation Loss
Final	30	29.9418	7.3284

💡 Key Findings

The FFT-C1 Mastery: The FFT model achieves a perfect 100% score for C1 questions, proving absolute structural discipline.
LLM Pedagogical Disobedience: Massive models like Llama-3.1 and Qwen struggle to follow rigid Bloom's Taxonomy syntax, resulting in lower structural compliance compared to the domain-adapted idT5.
The Structural-Semantic Gap: While structural compliance is high across the board for C1, the semantic agreement drops across all models for C2 questions, indicating that models learn the syntax of higher-order questions better than their deep meaning.
Future Work: These results strongly justify the transition to Representation Engineering to bridge this semantic gap.

📝 Citations

1. Ibrahim, F. (2026). Beyond Lexical Accuracy: Investigating the Structural and Semantic Compliance of idT5 in Indonesian Educational AQG. [Manuscript in preparation] 2. Ibrahim, F. (2026). Parameter-Efficient vs. Full Fine-Tuning for Indonesian Educational Question Generation: A Comparative Study on idT5. [Manuscript in preparation]

Downloads last month: 550

Safetensors

Model size

0.3B params

Tensor type

F32