JudolGuard — Gemma 4 E2B (GGUF)

On-device AI for detecting Indonesian online gambling (judol) — fully private, runs on consumer hardware.

Part of the Gemma 4 Good Hackathon.

🏆 TL;DR: We benchmarked the base model, fine-tuned with 7,336 samples, and ran the same benchmark again. The base model won — 90.87% accuracy, 1.41s/image, zero training cost. The fine-tune was unnecessary.


🎯 What This Model Does

Detects Indonesian online gambling content in images — slot machine UIs, casino lobbies, deposit/withdrawal prompts, gambling banners, and gambling links in chat screenshots. Runs on llama.cpp with --reasoning off for 6× faster inference.

Built for the Indonesian gambling crisis: 3.2 million active gamblers, children as young as 8. Existing defenses use keyword matching — easily bypassed. This model sees the page.

🇮🇩 The Language of Judol

Indonesian online gambling has its own slang — a parallel lexicon that blends local terms with gaming jargon:

Slang Meaning Context
Slot Slot machine gambling "situs slot" = gambling site
Gacor "Singing loudly" (bird metaphor) A machine that's "hot" / paying out
Maxwin Maximum win Maximum payout from a spin
JP Jackpot "Dapet JP" = hit the jackpot
Scatter Bonus symbol Triggers free spins
RTP Return to Player Claimed payout percentage
Zeus / Olympus Gates of Olympus slot Most viral Pragmatic Play game
Depo Deposit "Depo 10rb" = deposit ~$0.60
WD Withdraw "WD instant" = marketing claim
Togel Toto Gelap (dark lottery) Illegal lottery
Pulsa Phone credit Deposit method

These terms appear in images, not text — which is why keyword filters fail.


📊 Benchmark Results

Full 1,468-Image Evaluation

Model Backend Acc Recall Prec F1 Speed
🏆 Base + --reasoning off llama.cpp 90.87% 82.43% 99.18% 90.03% 1.41s 🚀
LoRA FT Q5_K_M LM Studio 89.22% 78.55% 99.83% 87.92% 3.09s
LoRA FT Q4_K_M LM Studio 79.63% 59.40% 99.77% 74.47% 2.76s
Base + reasoning ON* LM Studio ~96.11% ~94.23% ~99.28% ~96.66% 8.68s

*200-sample subset — full 1,468 run had 259 timeout errors.

Confusion Matrix — Winner (Base + --reasoning off)

              Pred Gambling    Pred Safe
Actual Gamb   605              129
Actual Safe   5                729
  • 99.3% safe classification — 5 false positives in 734 safe images
  • 129 missed gambling images — primary improvement target
  • Zero infrastructure failures — no crashes, no timeouts

Per-Class Performance

Class Base + Reasoning OFF 🏆 Q5_K_M (LoRA FT) Δ
Gambling → Gambling (TP) 605/734 = 82.4% 🔥 575/732 = 78.6% +3.8%
Safe → Safe (TN) 729/734 = 99.3% 733/734 = 99.9% −0.6%
Safe → Gambling (FP) 5/734 = 0.7% 1/734 = 0.1% +4 FP
Gambling → Safe (FN) 129/734 = 17.6% 157/732 = 21.4% −28 fewer misses 🔥

Base model catches 28 more gambling images than the LoRA FT, at the cost of 4 more false positives.

Why the Fine-Tune Destroyed Recall

The Q4_K_M fine-tune collapsed from 82.4% recall (base) to 59.4% — missed nearly half of all gambling images:

Model Recall Caught (of 734) Missed
🏆 Base + --reasoning off 82.4% 605 129
LoRA FT Q5_K_M 78.6% 575 157
LoRA FT Q4_K_M 59.4% 436 298

4-bit quantization on an already-fragile LoRA strips away visual generalization — the model memorized training examples but can't generalize to new patterns.

Failure Mode Analysis

What the winning model still misses (129 images):

Category Est. % Example
Subtle gambling indicators ~40% Small gambling text overlay, no obvious casino UI
Stylized/creative banners ~25% Artistic ads that don't look like typical slot UIs
Partial/ambiguous content ~20% Chat screenshots with mixed content
Low resolution / small images ~10% Favicons, tiny thumbnails
Other edge cases ~5% Unclassifiable borderline content

Speed Benchmark

All tests on AMD Radeon 8GB VRAM, llama.cpp Vulkan backend.

Metric Single (cache miss) Single (cache hit) 4 parallel 8 parallel
Latency 1.1s ~100ms 4.2s 8.3s
Per-slot tok/s 250 65 34

--parallel 4 gives 2× faster per-image latency than --parallel 8 with same total throughput.


🔬 Model Variants

Variant File Size Quant Status
🏆 Base Q4_K_M (from unsloth) gemma-4-E2B-it-Q4_K_M.gguf 3.2 GB 4-bit Production
Vision encoder mmproj-BF16.gguf 942 MB BF16 Required
Q5_K_M (LoRA FT) judol-guard-e2b-q5_k_m.gguf 3.4 GB 5-bit ❌ Legacy
Q4_K_M (LoRA FT) judol-guard-e2b-q4_k_m.gguf 3.2 GB 4-bit ❌ Legacy
Q8_0 (LoRA FT) judol-guard-e2b-q8_0.gguf 4.7 GB 8-bit ❌ OOM on 8GB

⚠️ Q5_K_M, Q4_K_M, Q8_0 on this repo are the LoRA fine-tuned variants — surpassed by the base model. Kept for reference only. Use the base Q4_K_M from unsloth/gemma-4-E2B-it-GGUF for production.


🧠 Fine-Tuning (Archived)

We curated 7,336 multimodal samples and fine-tuned using Unsloth LoRA on a free Google Colab T4 GPU:

Source Gambling Safe
Website screenshots 1,834 1,834
Slot machine / casino UIs 834 834
Social media screenshots 500 500
Chat app gambling links 500 500
Total 3,668 3,668
Model:          Gemma 4 E2B (2.3B + 150M vision)
Platform:       Google Colab T4 GPU (16 GB VRAM)
Framework:      Unsloth (2× faster LoRA kernels)
Training:       200 steps, batch 2 × 4 accum = 8 effective
Learning rate:  2e-4, linear schedule
Duration:       ~50 minutes
LoRA rank:      16, all linear layers (q/k/v/o/gate/up/down)

Result: The fine-tune degraded performance — base model won on every metric.

Why It Failed

  1. Label noise — some "gambling" images were educational content about gambling dangers
  2. Catastrophic interference — LoRA rank 16 on all layers overwrites useful base representations
  3. Binary classification paradox — Gemma 4's pretraining already has strong latent representations for gambling content

🚀 Deployment

Recommended Setup

# Download the BASE model (not our fine-tune)
huggingface-cli download unsloth/gemma-4-E2B-it-GGUF \
  gemma-4-E2B-it-Q4_K_M.gguf \
  mmproj-gemma-4-E2B-it-BF16.gguf

# Start llama.cpp server with reasoning off
./llama-server \
  -m gemma-4-E2B-it-Q4_K_M.gguf \
  --mmproj mmproj-gemma-4-E2B-it-BF16.gguf \
  --ctx-size 4096 \
  --flash-attn on \
  --cont-batching \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --reasoning off \
  --parallel 4 \
  -ngl 999 \
  --host 0.0.0.0 \
  --port 1234 \
  --log-disable

Key Flags

Flag Value Why
--reasoning off 🚀 CRITICAL — 6× speedup for binary classification
--parallel 4 Optimal: 65 tok/s per slot on 8GB VRAM
--flash-attn on ~2-3× attention speedup on AMD ROCm
--ctx-size 4096 Multimodal vision needs large context per image
--cache-type-k/v q8_0 Q8_0 KV cache saves ~50% VRAM

⚠️ Do NOT use LM Studio. Its reasoning: false does NOT disable reasoning — it only hides the thinking tokens. You MUST use llama.cpp directly.

Environment

Component Spec
GPU AMD Radeon (8 GB VRAM), ROCm via Vulkan
CPU AMD Ryzen, 32 GB system RAM
Inference llama-server on port 1234
Vision encoder mmproj-BF16.gguf (942 MB)
Tunnel Cloudflare Tunnel → inference.server-fadil.my.id:443

Chrome Extension

A Plasmo Manifest V3 extension with hybrid detection:

  • Aho-Corasick keyword scan (100+ keywords, instant)
  • Per-image AI analysis with real-time blur (4 concurrent)
  • 3 blocking modes: Full / Selective / Hide
  • Dynamic content detection via MutationObserver (infinite scroll)

mrayhanfadil/gemma_extension


🔑 Key Findings

  1. Binary multimodal classification → --reasoning off is mandatory. 6× speedup, ~5% accuracy loss. LM Studio's reasoning: false is deceptive.

  2. LoRA fine-tuning on small domain datasets can degrade performance. The base model won on every metric against our 7,336-sample fine-tune. Test the base model first.

  3. GPU attention is the bottleneck, not model capacity. --parallel 4 gives 2× faster per-image latency than --parallel 8.

  4. Hybrid detection (AC keywords + AI vision) catches both text patterns (instant) and visual content (high accuracy).


📄 Repositories


Built for the Gemma 4 Good Hackathon by Fadil.

Downloads last month
295
GGUF
Model size
5B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support