Instructions to use fadiil/judol-guard-gemma4-e2b-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use fadiil/judol-guard-gemma4-e2b-gguf with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="fadiil/judol-guard-gemma4-e2b-gguf",
	filename="judol-guard-e2b-q4_k_m.gguf",
)

llm.create_chat_completion(
	messages = "\"cats.jpg\""
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use fadiil/judol-guard-gemma4-e2b-gguf with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M

Use Docker

docker model run hf.co/fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M

LM Studio
Jan
Ollama
How to use fadiil/judol-guard-gemma4-e2b-gguf with Ollama:
```
ollama run hf.co/fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M
```

Unsloth Studio

How to use fadiil/judol-guard-gemma4-e2b-gguf with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for fadiil/judol-guard-gemma4-e2b-gguf to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for fadiil/judol-guard-gemma4-e2b-gguf to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for fadiil/judol-guard-gemma4-e2b-gguf to start chatting

How to use fadiil/judol-guard-gemma4-e2b-gguf with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use fadiil/judol-guard-gemma4-e2b-gguf with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use fadiil/judol-guard-gemma4-e2b-gguf with Docker Model Runner:
```
docker model run hf.co/fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M
```

Lemonade

How to use fadiil/judol-guard-gemma4-e2b-gguf with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull fadiil/judol-guard-gemma4-e2b-gguf:Q4_K_M

Run and chat with the model

lemonade run user.judol-guard-gemma4-e2b-gguf-Q4_K_M

List all available models

lemonade list

JudolGuard — Gemma 4 E2B (GGUF)

On-device AI for detecting Indonesian online gambling (judol) — fully private, runs on consumer hardware.

Part of the Gemma 4 Good Hackathon.

🏆 TL;DR: We benchmarked the base model, fine-tuned with 7,336 samples, and ran the same benchmark again. The base model won — 90.87% accuracy, 1.41s/image, zero training cost. The fine-tune was unnecessary.

🎯 What This Model Does

Detects Indonesian online gambling content in images — slot machine UIs, casino lobbies, deposit/withdrawal prompts, gambling banners, and gambling links in chat screenshots. Runs on llama.cpp with --reasoning off for 6× faster inference.

Built for the Indonesian gambling crisis: 3.2 million active gamblers, children as young as 8. Existing defenses use keyword matching — easily bypassed. This model sees the page.

🇮🇩 The Language of Judol

Indonesian online gambling has its own slang — a parallel lexicon that blends local terms with gaming jargon:

Slang	Meaning	Context
Slot	Slot machine gambling	"situs slot" = gambling site
Gacor	"Singing loudly" (bird metaphor)	A machine that's "hot" / paying out
Maxwin	Maximum win	Maximum payout from a spin
JP	Jackpot	"Dapet JP" = hit the jackpot
Scatter	Bonus symbol	Triggers free spins
RTP	Return to Player	Claimed payout percentage
Zeus / Olympus	Gates of Olympus slot	Most viral Pragmatic Play game
Depo	Deposit	"Depo 10rb" = deposit ~$0.60
WD	Withdraw	"WD instant" = marketing claim
Togel	Toto Gelap (dark lottery)	Illegal lottery
Pulsa	Phone credit	Deposit method

These terms appear in images, not text — which is why keyword filters fail.

📊 Benchmark Results

Full 1,468-Image Evaluation

Model	Backend	Acc	Recall	Prec	F1	Speed
🏆 Base + `--reasoning off`	llama.cpp	90.87%	82.43%	99.18%	90.03%	1.41s 🚀
LoRA FT Q5_K_M	LM Studio	89.22%	78.55%	99.83%	87.92%	3.09s
LoRA FT Q4_K_M	LM Studio	79.63%	59.40%	99.77%	74.47%	2.76s
Base + reasoning ON*	LM Studio	~96.11%	~94.23%	~99.28%	~96.66%	8.68s

*200-sample subset — full 1,468 run had 259 timeout errors.

Confusion Matrix — Winner (Base + `--reasoning off`)

              Pred Gambling    Pred Safe
Actual Gamb   605              129
Actual Safe   5                729

99.3% safe classification — 5 false positives in 734 safe images
129 missed gambling images — primary improvement target
Zero infrastructure failures — no crashes, no timeouts

Per-Class Performance

Class	Base + Reasoning OFF 🏆	Q5_K_M (LoRA FT)	Δ
Gambling → Gambling (TP)	605/734 = 82.4% 🔥	575/732 = 78.6%	+3.8%
Safe → Safe (TN)	729/734 = 99.3%	733/734 = 99.9%	−0.6%
Safe → Gambling (FP)	5/734 = 0.7%	1/734 = 0.1%	+4 FP
Gambling → Safe (FN)	129/734 = 17.6%	157/732 = 21.4%	−28 fewer misses 🔥

Base model catches 28 more gambling images than the LoRA FT, at the cost of 4 more false positives.

Why the Fine-Tune Destroyed Recall

The Q4_K_M fine-tune collapsed from 82.4% recall (base) to 59.4% — missed nearly half of all gambling images:

Model	Recall	Caught (of 734)	Missed
🏆 Base + `--reasoning off`	82.4%	605	129
LoRA FT Q5_K_M	78.6%	575	157
LoRA FT Q4_K_M	59.4% ❌	436	298

4-bit quantization on an already-fragile LoRA strips away visual generalization — the model memorized training examples but can't generalize to new patterns.

Failure Mode Analysis

What the winning model still misses (129 images):

Category	Est. %	Example
Subtle gambling indicators	~40%	Small gambling text overlay, no obvious casino UI
Stylized/creative banners	~25%	Artistic ads that don't look like typical slot UIs
Partial/ambiguous content	~20%	Chat screenshots with mixed content
Low resolution / small images	~10%	Favicons, tiny thumbnails
Other edge cases	~5%	Unclassifiable borderline content

Speed Benchmark

All tests on AMD Radeon 8GB VRAM, llama.cpp Vulkan backend.

Metric	Single (cache miss)	Single (cache hit)	4 parallel	8 parallel
Latency	1.1s	~100ms	4.2s	8.3s
Per-slot tok/s	250	—	65	34

--parallel 4 gives 2× faster per-image latency than --parallel 8 with same total throughput.

🔬 Model Variants

Variant	File	Size	Quant	Status
🏆 Base Q4_K_M (from unsloth)	`gemma-4-E2B-it-Q4_K_M.gguf`	3.2 GB	4-bit	Production
Vision encoder	`mmproj-BF16.gguf`	942 MB	BF16	Required
Q5_K_M (LoRA FT)	`judol-guard-e2b-q5_k_m.gguf`	3.4 GB	5-bit	❌ Legacy
Q4_K_M (LoRA FT)	`judol-guard-e2b-q4_k_m.gguf`	3.2 GB	4-bit	❌ Legacy
Q8_0 (LoRA FT)	`judol-guard-e2b-q8_0.gguf`	4.7 GB	8-bit	❌ OOM on 8GB

⚠️ Q5_K_M, Q4_K_M, Q8_0 on this repo are the LoRA fine-tuned variants — surpassed by the base model. Kept for reference only. Use the base Q4_K_M from unsloth/gemma-4-E2B-it-GGUF for production.

🧠 Fine-Tuning (Archived)

We curated 7,336 multimodal samples and fine-tuned using Unsloth LoRA on a free Google Colab T4 GPU:

Source	Gambling	Safe
Website screenshots	1,834	1,834
Slot machine / casino UIs	834	834
Social media screenshots	500	500
Chat app gambling links	500	500
Total	3,668	3,668

Model:          Gemma 4 E2B (2.3B + 150M vision)
Platform:       Google Colab T4 GPU (16 GB VRAM)
Framework:      Unsloth (2× faster LoRA kernels)
Training:       200 steps, batch 2 × 4 accum = 8 effective
Learning rate:  2e-4, linear schedule
Duration:       ~50 minutes
LoRA rank:      16, all linear layers (q/k/v/o/gate/up/down)

Result: The fine-tune degraded performance — base model won on every metric.

Why It Failed

Label noise — some "gambling" images were educational content about gambling dangers
Catastrophic interference — LoRA rank 16 on all layers overwrites useful base representations
Binary classification paradox — Gemma 4's pretraining already has strong latent representations for gambling content

🚀 Deployment

Recommended Setup

# Download the BASE model (not our fine-tune)
huggingface-cli download unsloth/gemma-4-E2B-it-GGUF \
  gemma-4-E2B-it-Q4_K_M.gguf \
  mmproj-gemma-4-E2B-it-BF16.gguf

# Start llama.cpp server with reasoning off
./llama-server \
  -m gemma-4-E2B-it-Q4_K_M.gguf \
  --mmproj mmproj-gemma-4-E2B-it-BF16.gguf \
  --ctx-size 4096 \
  --flash-attn on \
  --cont-batching \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --reasoning off \
  --parallel 4 \
  -ngl 999 \
  --host 0.0.0.0 \
  --port 1234 \
  --log-disable

Key Flags

Flag	Value	Why
`--reasoning off`	—	🚀 CRITICAL — 6× speedup for binary classification
`--parallel`	4	Optimal: 65 tok/s per slot on 8GB VRAM
`--flash-attn`	`on`	~2-3× attention speedup on AMD ROCm
`--ctx-size`	4096	Multimodal vision needs large context per image
`--cache-type-k/v`	`q8_0`	Q8_0 KV cache saves ~50% VRAM

⚠️ Do NOT use LM Studio. Its reasoning: false does NOT disable reasoning — it only hides the thinking tokens. You MUST use llama.cpp directly.

Environment

Component	Spec
GPU	AMD Radeon (8 GB VRAM), ROCm via Vulkan
CPU	AMD Ryzen, 32 GB system RAM
Inference	llama-server on port 1234
Vision encoder	mmproj-BF16.gguf (942 MB)
Tunnel	Cloudflare Tunnel → `inference.server-fadil.my.id:443`

Chrome Extension

A Plasmo Manifest V3 extension with hybrid detection:

Aho-Corasick keyword scan (100+ keywords, instant)
Per-image AI analysis with real-time blur (4 concurrent)
3 blocking modes: Full / Selective / Hide
Dynamic content detection via MutationObserver (infinite scroll)

→ mrayhanfadil/gemma_extension

🔑 Key Findings

Binary multimodal classification → --reasoning off is mandatory. 6× speedup, ~5% accuracy loss. LM Studio's reasoning: false is deceptive.
LoRA fine-tuning on small domain datasets can degrade performance. The base model won on every metric against our 7,336-sample fine-tune. Test the base model first.
GPU attention is the bottleneck, not model capacity. --parallel 4 gives 2× faster per-image latency than --parallel 8.
Hybrid detection (AC keywords + AI vision) catches both text patterns (instant) and visual content (high accuracy).

📄 Repositories

Model (here): fadiil/judol-guard-gemma4-e2b-gguf
Full project: mrayhanfadil/gemma — benchmarks, fine-tuning, Android app scaffold
Extension: mrayhanfadil/gemma_extension — Chrome extension with full WRITEUP
Full writeup: WRITEUP.md

Built for the Gemma 4 Good Hackathon by Fadil.

Downloads last month: 295

GGUF

Model size

5B params

Architecture

gemma4

Hardware compatibility

4-bit

5-bit

8-bit