Instructions to use spy5er/Gemma4NPC-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use spy5er/Gemma4NPC-it with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="spy5er/Gemma4NPC-it")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("spy5er/Gemma4NPC-it")
model = AutoModelForMultimodalLM.from_pretrained("spy5er/Gemma4NPC-it")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

llama-cpp-python

How to use spy5er/Gemma4NPC-it with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="spy5er/Gemma4NPC-it",
	filename="merged_float16-F16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use spy5er/Gemma4NPC-it with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf spy5er/Gemma4NPC-it:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf spy5er/Gemma4NPC-it:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf spy5er/Gemma4NPC-it:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf spy5er/Gemma4NPC-it:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf spy5er/Gemma4NPC-it:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf spy5er/Gemma4NPC-it:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf spy5er/Gemma4NPC-it:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf spy5er/Gemma4NPC-it:Q4_K_M

Use Docker

docker model run hf.co/spy5er/Gemma4NPC-it:Q4_K_M

LM Studio
Jan

vLLM

How to use spy5er/Gemma4NPC-it with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "spy5er/Gemma4NPC-it"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "spy5er/Gemma4NPC-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/spy5er/Gemma4NPC-it:Q4_K_M

SGLang

How to use spy5er/Gemma4NPC-it with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "spy5er/Gemma4NPC-it" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "spy5er/Gemma4NPC-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "spy5er/Gemma4NPC-it" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "spy5er/Gemma4NPC-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use spy5er/Gemma4NPC-it with Ollama:
```
ollama run hf.co/spy5er/Gemma4NPC-it:Q4_K_M
```

Unsloth Studio

How to use spy5er/Gemma4NPC-it with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for spy5er/Gemma4NPC-it to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for spy5er/Gemma4NPC-it to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for spy5er/Gemma4NPC-it to start chatting

Atomic Chat new
Docker Model Runner
How to use spy5er/Gemma4NPC-it with Docker Model Runner:
```
docker model run hf.co/spy5er/Gemma4NPC-it:Q4_K_M
```

Lemonade

How to use spy5er/Gemma4NPC-it with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull spy5er/Gemma4NPC-it:Q4_K_M

Run and chat with the model

lemonade run user.Gemma4NPC-it-Q4_K_M

List all available models

lemonade list

Model Card for Gemma4NPC-12B-it

Model Details

Model Description

Gemma4NPC-12B-it is a 12 billion parameter language model specifically fine-tuned and aligned to serve as the backend for Non-Playable Characters (NPCs) in video games. Built on top of Google's Gemma 4 architecture, this model is designed to solve common issues faced when integrating Large Language Models into game engines: character consistency and structured data output.

Traditional language models often break immersion by referencing their nature as AI assistants or failing to output data in a format that a game engine can parse. Gemma4NPC addresses this by combining Supervised Fine-Tuning (SFT) for strict roleplay adherence with Direct Preference Optimization (DPO) to guarantee outputs in valid, machine-readable JSON formats. This allows game developers to seamlessly parse the NPC's dialogue alongside mathematical game-state updates (such as quest flags, inventory trades, or mood variables) directly in engines like Unity, Unreal Engine, or Godot.

Developed by: spy5er
Model type: Causal Language Model
Language(s): English
License: Gemma License
Finetuned from model: google/gemma-4-12b

Intended Uses & Limitations

Intended Use Cases:

Real-time NPC dialogue generation in video games.
Structured inference requiring strict JSON formatting alongside natural language.
Interactive storytelling and text-based roleplaying environments.

Limitations:

The model is heavily optimized for short-context, turn-based dialogue and may struggle with long-form essay generation or generalized assistant tasks.
It is fine-tuned to maintain character immersion; therefore, it will actively resist breaking character even if explicitly prompted to do so.

Training Details

Training Data

The model was fine-tuned on a heavily sanitized and augmented subset of the PIPPA (Persona-Interacting Professional Play-Acting) dataset. The data was structured using the ChatML format. For the alignment phase, preference pairs were synthetically generated to penalize out-of-character behavior and reward strict JSON formatting. For more details, refer to the accompanying Dataset Card.

Training Procedure

The training pipeline was executed in two distinct phases:

Supervised Fine-Tuning (SFT): The base Gemma-4-12B model was fine-tuned on the sanitized roleplay dataset to learn the basic grammar of acting as an NPC. The model was trained to output its responses wrapped inside structured JSON blocks.
Direct Preference Optimization (DPO): To heavily discourage hallucinations and character breaks, the model underwent DPO. The model was presented with paired responses (a chosen "in-character" response with perfect JSON, and a rejected "out-of-character" or improperly formatted response). This mathematical alignment severely punishes the weights responsible for AI-like apologies and rewards rigid adherence to game logic.

Technical Specifications

Architecture

Gemma4NPC-12B-it retains the core architecture of Gemma 4. The model weights provided in this repository are available in both unquantized (Float16) and quantized (GGUF) formats.

Quantization (GGUF)

To facilitate local inference on consumer hardware and Apple Silicon (M-series Macs), the model has been quantized to the Q4_K_M GGUF format. This compresses the 24 GB Float16 model down to approximately 7.5 GB while maintaining over 95% of its reasoning quality. This allows the model to comfortably fit entirely within VRAM/Unified Memory, achieving speeds of 30 to 45 tokens per second.

Comparison with Other Models

Gemma4NPC-12B-it vs. chimbiwide/Gemma4NPC-E4B

There is another excellent community project, chimbiwide/Gemma4NPC-E4B, which targets similar NPC roleplay use cases. Here is a brief architectural and performance comparison to help you choose the right model for your project:

Model Size & Reasoning: Gemma4NPC-E4B is built on the smaller Gemma-4-E4B (4 Billion parameter) foundation, making it highly efficient. However, our Gemma4NPC-12B-it leverages the 12 Billion parameter base, providing significantly deeper logic and context tracking for complex, multi-layered player negotiations.
Engine Integration: Gemma4NPC-E4B outputs pure conversational text, requiring complex Regex parsing to extract game events. Our model is trained to output strict JSON schemas (e.g., {"dialogue": "...", "agreed_price": 500}), acting as a true engine backend for updating UI, inventories, and quests.
Alignment: Gemma4NPC-E4B is trained via Supervised Fine-Tuning (LoRA). Our model employs a two-step SFT + Direct Preference Optimization (DPO) pipeline to actively penalize character breaks and item hallucinations.
Latency: As a 4B model, Gemma4NPC-E4B is naturally faster. We bridge this gap by aggressively quantizing our 12B model to Q4_K_M GGUF, achieving 30-45+ tokens per second on consumer hardware while maintaining the superior reasoning of a 12B parameter model.

How to Get Started with the Model

You can run this model locally using llama.cpp or llama-cpp-python.

1. Download the Quantized Model:

huggingface-cli download spy5er/Gemma4NPC-12B-it merged_float16.Q4_K_M.gguf --local-dir models/

2. Python Inference Server (FastAPI Example):

from llama_cpp import Llama

# Load the model directly into your local GPU (Metal for Mac)
llm = Llama(
    model_path="models/merged_float16.Q4_K_M.gguf",
    n_gpu_layers=-1, # Loads 100% of the model into the GPU for maximum speed
    chat_format="gemma"
)

# Request a structured JSON output
response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are Gringo the Greedy, a goblin merchant..."},
        {"role": "user", "content": "I will give you 50 gold for that Amulet!"}
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "schema": {
                "type": "object",
                "properties": {
                    "dialogue": {"type": "string"},
                    "agreed_price": {"type": "integer"}
                },
                "required": ["dialogue", "agreed_price"],
                "additionalProperties": False
            }
        }
    }
)

print(response["choices"][0]["message"]["content"])
# Output: {"dialogue": "50 gold?! Are you mad? Make it 450!", "agreed_price": 500}

Downloads last month: 329

Safetensors

Model size

12B params

Tensor type

BF16

spy5er
/

Gemma4NPC-it