Instructions to use fadiil/judol-guard-gemma4-e2b-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use fadiil/judol-guard-gemma4-e2b-gguf with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="fadiil/judol-guard-gemma4-e2b-gguf", filename="judol-guard-e2b-q4_k_m.gguf", )
llm.create_chat_completion( messages = "\"cats.jpg\"" )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use fadiil/judol-guard-gemma4-e2b-gguf with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M # Run inference directly in the terminal: llama-cli -hf fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M # Run inference directly in the terminal: llama-cli -hf fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M
Use Docker
docker model run hf.co/fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use fadiil/judol-guard-gemma4-e2b-gguf with Ollama:
ollama run hf.co/fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M
- Unsloth Studio
How to use fadiil/judol-guard-gemma4-e2b-gguf with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for fadiil/judol-guard-gemma4-e2b-gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for fadiil/judol-guard-gemma4-e2b-gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for fadiil/judol-guard-gemma4-e2b-gguf to start chatting
- Pi
How to use fadiil/judol-guard-gemma4-e2b-gguf with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use fadiil/judol-guard-gemma4-e2b-gguf with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use fadiil/judol-guard-gemma4-e2b-gguf with Docker Model Runner:
docker model run hf.co/fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M
- Lemonade
How to use fadiil/judol-guard-gemma4-e2b-gguf with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M
Run and chat with the model
lemonade run user.judol-guard-gemma4-e2b-gguf-Q4_K_M
List all available models
lemonade list
JudolGuard — Gemma 4 E2B (GGUF)
On-device AI for detecting Indonesian online gambling (judol) — fully private, runs on consumer hardware.
Part of the Gemma 4 Good Hackathon.
🏆 TL;DR: We benchmarked the base model, fine-tuned with 7,336 samples, and ran the same benchmark again. The base model won — 90.87% accuracy, 1.41s/image, zero training cost. The fine-tune was unnecessary.
🎯 What This Model Does
Detects Indonesian online gambling content in images — slot machine UIs, casino lobbies, deposit/withdrawal prompts, gambling banners, and gambling links in chat screenshots. Runs on llama.cpp with --reasoning off for 6× faster inference.
Built for the Indonesian gambling crisis: 3.2 million active gamblers, children as young as 8. Existing defenses use keyword matching — easily bypassed. This model sees the page.
🇮🇩 The Language of Judol
Indonesian online gambling has its own slang — a parallel lexicon that blends local terms with gaming jargon:
| Slang | Meaning | Context |
|---|---|---|
| Slot | Slot machine gambling | "situs slot" = gambling site |
| Gacor | "Singing loudly" (bird metaphor) | A machine that's "hot" / paying out |
| Maxwin | Maximum win | Maximum payout from a spin |
| JP | Jackpot | "Dapet JP" = hit the jackpot |
| Scatter | Bonus symbol | Triggers free spins |
| RTP | Return to Player | Claimed payout percentage |
| Zeus / Olympus | Gates of Olympus slot | Most viral Pragmatic Play game |
| Depo | Deposit | "Depo 10rb" = deposit ~$0.60 |
| WD | Withdraw | "WD instant" = marketing claim |
| Togel | Toto Gelap (dark lottery) | Illegal lottery |
| Pulsa | Phone credit | Deposit method |
These terms appear in images, not text — which is why keyword filters fail.
📊 Benchmark Results
Full 1,468-Image Evaluation
| Model | Backend | Acc | Recall | Prec | F1 | Speed |
|---|---|---|---|---|---|---|
🏆 Base + --reasoning off |
llama.cpp | 90.87% | 82.43% | 99.18% | 90.03% | 1.41s 🚀 |
| LoRA FT Q5_K_M | LM Studio | 89.22% | 78.55% | 99.83% | 87.92% | 3.09s |
| LoRA FT Q4_K_M | LM Studio | 79.63% | 59.40% | 99.77% | 74.47% | 2.76s |
| Base + reasoning ON* | LM Studio | ~96.11% | ~94.23% | ~99.28% | ~96.66% | 8.68s |
*200-sample subset — full 1,468 run had 259 timeout errors.
Confusion Matrix — Winner (Base + --reasoning off)
Pred Gambling Pred Safe
Actual Gamb 605 129
Actual Safe 5 729
- 99.3% safe classification — 5 false positives in 734 safe images
- 129 missed gambling images — primary improvement target
- Zero infrastructure failures — no crashes, no timeouts
Per-Class Performance
| Class | Base + Reasoning OFF 🏆 | Q5_K_M (LoRA FT) | Δ |
|---|---|---|---|
| Gambling → Gambling (TP) | 605/734 = 82.4% 🔥 | 575/732 = 78.6% | +3.8% |
| Safe → Safe (TN) | 729/734 = 99.3% | 733/734 = 99.9% | −0.6% |
| Safe → Gambling (FP) | 5/734 = 0.7% | 1/734 = 0.1% | +4 FP |
| Gambling → Safe (FN) | 129/734 = 17.6% | 157/732 = 21.4% | −28 fewer misses 🔥 |
Base model catches 28 more gambling images than the LoRA FT, at the cost of 4 more false positives.
Why the Fine-Tune Destroyed Recall
The Q4_K_M fine-tune collapsed from 82.4% recall (base) to 59.4% — missed nearly half of all gambling images:
| Model | Recall | Caught (of 734) | Missed |
|---|---|---|---|
🏆 Base + --reasoning off |
82.4% | 605 | 129 |
| LoRA FT Q5_K_M | 78.6% | 575 | 157 |
| LoRA FT Q4_K_M | 59.4% ❌ | 436 | 298 |
4-bit quantization on an already-fragile LoRA strips away visual generalization — the model memorized training examples but can't generalize to new patterns.
Failure Mode Analysis
What the winning model still misses (129 images):
| Category | Est. % | Example |
|---|---|---|
| Subtle gambling indicators | ~40% | Small gambling text overlay, no obvious casino UI |
| Stylized/creative banners | ~25% | Artistic ads that don't look like typical slot UIs |
| Partial/ambiguous content | ~20% | Chat screenshots with mixed content |
| Low resolution / small images | ~10% | Favicons, tiny thumbnails |
| Other edge cases | ~5% | Unclassifiable borderline content |
Speed Benchmark
All tests on AMD Radeon 8GB VRAM, llama.cpp Vulkan backend.
| Metric | Single (cache miss) | Single (cache hit) | 4 parallel | 8 parallel |
|---|---|---|---|---|
| Latency | 1.1s | ~100ms | 4.2s | 8.3s |
| Per-slot tok/s | 250 | — | 65 | 34 |
--parallel 4gives 2× faster per-image latency than--parallel 8with same total throughput.
🔬 Model Variants
| Variant | File | Size | Quant | Status |
|---|---|---|---|---|
| 🏆 Base Q4_K_M (from unsloth) | gemma-4-E2B-it-Q4_K_M.gguf |
3.2 GB | 4-bit | Production |
| Vision encoder | mmproj-BF16.gguf |
942 MB | BF16 | Required |
| Q5_K_M (LoRA FT) | judol-guard-e2b-q5_k_m.gguf |
3.4 GB | 5-bit | ❌ Legacy |
| Q4_K_M (LoRA FT) | judol-guard-e2b-q4_k_m.gguf |
3.2 GB | 4-bit | ❌ Legacy |
| Q8_0 (LoRA FT) | judol-guard-e2b-q8_0.gguf |
4.7 GB | 8-bit | ❌ OOM on 8GB |
⚠️ Q5_K_M, Q4_K_M, Q8_0 on this repo are the LoRA fine-tuned variants — surpassed by the base model. Kept for reference only. Use the base Q4_K_M from unsloth/gemma-4-E2B-it-GGUF for production.
🧠 Fine-Tuning (Archived)
We curated 7,336 multimodal samples and fine-tuned using Unsloth LoRA on a free Google Colab T4 GPU:
| Source | Gambling | Safe |
|---|---|---|
| Website screenshots | 1,834 | 1,834 |
| Slot machine / casino UIs | 834 | 834 |
| Social media screenshots | 500 | 500 |
| Chat app gambling links | 500 | 500 |
| Total | 3,668 | 3,668 |
Model: Gemma 4 E2B (2.3B + 150M vision)
Platform: Google Colab T4 GPU (16 GB VRAM)
Framework: Unsloth (2× faster LoRA kernels)
Training: 200 steps, batch 2 × 4 accum = 8 effective
Learning rate: 2e-4, linear schedule
Duration: ~50 minutes
LoRA rank: 16, all linear layers (q/k/v/o/gate/up/down)
Result: The fine-tune degraded performance — base model won on every metric.
Why It Failed
- Label noise — some "gambling" images were educational content about gambling dangers
- Catastrophic interference — LoRA rank 16 on all layers overwrites useful base representations
- Binary classification paradox — Gemma 4's pretraining already has strong latent representations for gambling content
🚀 Deployment
Recommended Setup
# Download the BASE model (not our fine-tune)
huggingface-cli download unsloth/gemma-4-E2B-it-GGUF \
gemma-4-E2B-it-Q4_K_M.gguf \
mmproj-gemma-4-E2B-it-BF16.gguf
# Start llama.cpp server with reasoning off
./llama-server \
-m gemma-4-E2B-it-Q4_K_M.gguf \
--mmproj mmproj-gemma-4-E2B-it-BF16.gguf \
--ctx-size 4096 \
--flash-attn on \
--cont-batching \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--reasoning off \
--parallel 4 \
-ngl 999 \
--host 0.0.0.0 \
--port 1234 \
--log-disable
Key Flags
| Flag | Value | Why |
|---|---|---|
--reasoning off |
— | 🚀 CRITICAL — 6× speedup for binary classification |
--parallel |
4 | Optimal: 65 tok/s per slot on 8GB VRAM |
--flash-attn |
on |
~2-3× attention speedup on AMD ROCm |
--ctx-size |
4096 | Multimodal vision needs large context per image |
--cache-type-k/v |
q8_0 |
Q8_0 KV cache saves ~50% VRAM |
⚠️ Do NOT use LM Studio. Its
reasoning: falsedoes NOT disable reasoning — it only hides the thinking tokens. You MUST use llama.cpp directly.
Environment
| Component | Spec |
|---|---|
| GPU | AMD Radeon (8 GB VRAM), ROCm via Vulkan |
| CPU | AMD Ryzen, 32 GB system RAM |
| Inference | llama-server on port 1234 |
| Vision encoder | mmproj-BF16.gguf (942 MB) |
| Tunnel | Cloudflare Tunnel → inference.server-fadil.my.id:443 |
Chrome Extension
A Plasmo Manifest V3 extension with hybrid detection:
- Aho-Corasick keyword scan (100+ keywords, instant)
- Per-image AI analysis with real-time blur (4 concurrent)
- 3 blocking modes: Full / Selective / Hide
- Dynamic content detection via MutationObserver (infinite scroll)
→ mrayhanfadil/gemma_extension
🔑 Key Findings
Binary multimodal classification →
--reasoning offis mandatory. 6× speedup, ~5% accuracy loss. LM Studio'sreasoning: falseis deceptive.LoRA fine-tuning on small domain datasets can degrade performance. The base model won on every metric against our 7,336-sample fine-tune. Test the base model first.
GPU attention is the bottleneck, not model capacity.
--parallel 4gives 2× faster per-image latency than--parallel 8.Hybrid detection (AC keywords + AI vision) catches both text patterns (instant) and visual content (high accuracy).
📄 Repositories
- Model (here): fadiil/judol-guard-gemma4-e2b-gguf
- Full project: mrayhanfadil/gemma — benchmarks, fine-tuning, Android app scaffold
- Extension: mrayhanfadil/gemma_extension — Chrome extension with full WRITEUP
- Full writeup: WRITEUP.md
Built for the Gemma 4 Good Hackathon by Fadil.
- Downloads last month
- 295
4-bit
5-bit
8-bit