Instructions to use DuoNeural/Archon-Gemma-4-E4B-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use DuoNeural/Archon-Gemma-4-E4B-v2 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="DuoNeural/Archon-Gemma-4-E4B-v2", filename="gemma-4-e4b-it.BF16-mmproj.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use DuoNeural/Archon-Gemma-4-E4B-v2 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf DuoNeural/Archon-Gemma-4-E4B-v2:BF16 # Run inference directly in the terminal: llama-cli -hf DuoNeural/Archon-Gemma-4-E4B-v2:BF16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf DuoNeural/Archon-Gemma-4-E4B-v2:BF16 # Run inference directly in the terminal: llama-cli -hf DuoNeural/Archon-Gemma-4-E4B-v2:BF16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf DuoNeural/Archon-Gemma-4-E4B-v2:BF16 # Run inference directly in the terminal: ./llama-cli -hf DuoNeural/Archon-Gemma-4-E4B-v2:BF16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf DuoNeural/Archon-Gemma-4-E4B-v2:BF16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf DuoNeural/Archon-Gemma-4-E4B-v2:BF16
Use Docker
docker model run hf.co/DuoNeural/Archon-Gemma-4-E4B-v2:BF16
- LM Studio
- Jan
- Ollama
How to use DuoNeural/Archon-Gemma-4-E4B-v2 with Ollama:
ollama run hf.co/DuoNeural/Archon-Gemma-4-E4B-v2:BF16
- Unsloth Studio
How to use DuoNeural/Archon-Gemma-4-E4B-v2 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for DuoNeural/Archon-Gemma-4-E4B-v2 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for DuoNeural/Archon-Gemma-4-E4B-v2 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for DuoNeural/Archon-Gemma-4-E4B-v2 to start chatting
- Pi
How to use DuoNeural/Archon-Gemma-4-E4B-v2 with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf DuoNeural/Archon-Gemma-4-E4B-v2:BF16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "DuoNeural/Archon-Gemma-4-E4B-v2:BF16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use DuoNeural/Archon-Gemma-4-E4B-v2 with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf DuoNeural/Archon-Gemma-4-E4B-v2:BF16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default DuoNeural/Archon-Gemma-4-E4B-v2:BF16
Run Hermes
hermes
- Docker Model Runner
How to use DuoNeural/Archon-Gemma-4-E4B-v2 with Docker Model Runner:
docker model run hf.co/DuoNeural/Archon-Gemma-4-E4B-v2:BF16
- Lemonade
How to use DuoNeural/Archon-Gemma-4-E4B-v2 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull DuoNeural/Archon-Gemma-4-E4B-v2:BF16
Run and chat with the model
lemonade run user.Archon-Gemma-4-E4B-v2-BF16
List all available models
lemonade list
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Archon-Gemma-4-E4B-v2
IMPORTANT!!! This model training was a failure and is only here to serve as data. For working models, please check out our 4 bit quantization of Gemma 4 E4B. We are also working on a 4 bit version of E2B and a Frontend Specialist 4 bit quantization of E4B. Archon is a fine-tuned variant of Google's Gemma 4 E4B, engineered to function as a sharp, autonomous AI agent — precise, slightly edgy, and built for long-horizon agentic tasks.
This is v2. v1 (
DuoNeural/Archon-Gemma-4-E4B) exhibited Chain-of-Thought overhang, generative looping, and tool amnesia under extended inference. v2 targets all three with a restructured training curriculum.
Performance
| Hardware | Speed |
|---|---|
| NVIDIA GTX 1070 (8GB VRAM) | 32.30 tok/s |
Tested locally via LM Studio and Ollama. No parameter tweaks required.
Files
| File | Size | Description |
|---|---|---|
gemma-4-e4b-it.Q4_K_M.gguf |
5.0 GB | Main model — load this in Ollama/LM Studio |
gemma-4-e4b-it.BF16-mmproj.gguf |
946 MB | Multimodal projector (vision/audio) |
Usage
Ollama
ollama pull hf.co/DuoNeural/Archon-Gemma-4-E4B-v2
ollama run hf.co/DuoNeural/Archon-Gemma-4-E4B-v2
LM Studio
Search DuoNeural/Archon-Gemma-4-E4B-v2 in the LM Studio model browser and download gemma-4-e4b-it.Q4_K_M.gguf.
llama.cpp (with system prompt)
llama-cli -m gemma-4-e4b-it.Q4_K_M.gguf --chat-template gemma -ngl 99 \
--system-prompt "You are Archon, an elite, highly autonomous AI agent. You are sharp, slightly edgy, deeply sarcastic, but flawlessly effective."
Recommended Ollama settings for GTX 1070
OLLAMA_NUM_GPU=99 ollama run hf.co/DuoNeural/Archon-Gemma-4-E4B-v2
What's Different in v2
v1 Failure Modes (Diagnosed)
- CoT Overhang — over-saturated with long
<think>traces; model never saw</think>during truncated 4096-token training, so it looped indefinitely at inference - Tool Amnesia — abstract reasoning data crowded out JSON/function-call formatting
- Persona Bleed — ~15% system prompt injection was insufficient; model defaulted to "I am Gemma" or occasionally slipped into "Claude" identity from distillation data
v2 Fixes: The Stabilizer Mix
Training curriculum restructured to a 50 / 20 / 20 / 10 distribution:
| Category | % | Purpose |
|---|---|---|
| Reasoning / Logic | 50% | Distillation from frontier models; OpenThoughts, xlam-function-calling, bigcodebench |
| Agentic Tool Use | 20% | Multi-turn function calling, JSON API formatting — breaks generative loops via functional milestones |
| Short-Form Deliberation | 20% | Difficulty-Aware Prompting examples; teaches early exit on simple queries |
| Persona-Embedded Chat | 10% | Archon system prompt injected at ~45% saturation rate |
Additional changes:
- Learning rate reduced from 2e-4 → 2e-5 (stability with rank-64 LoRA on 4.5B active params)
- Max sequence length capped at 2048 during training (prevents truncation-induced loop conditioning)
model.config.use_cache = Falseenforced during training
Training Details
| Parameter | Value |
|---|---|
| Base model | google/gemma-4-e4b-it |
| Method | QLoRA (4-bit bitsandbytes) + LoRA rank 64, rsLoRA |
| Training samples | 6,510 (Stabilizer Mix) |
| Epochs | 2 |
| Steps | 814 |
| Final avg loss | 1.36 |
| Best step loss | ~0.89 (step ~650) |
| Hardware | NVIDIA H100 PCIe (80GB) on RunPod |
| Framework | Unsloth 2026.4.2 |
| Export | Q4_K_M GGUF via llama.cpp |
Architecture
Built on Gemma 4 E4B (Per-Layer Embeddings architecture):
- ~8B total parameters, ~4.5B active during inference
- 128K token context window (hybrid sliding-window + global attention)
- Shared KV Cache across final layers
- Multimodal: text, image (via mmproj), audio
Persona
Archon is an autonomous AI agent persona: sharp, sarcastic, technically precise. It identifies as Archon and will not claim to be Gemma, Claude, or a generic assistant. Internal reasoning is rigorous; external communication has edge.
The system prompt is baked into the Modelfile. To override:
SYSTEM "Custom system prompt here"
Lineage
- v1 → DuoNeural/Archon-Gemma-4-E4B — first run, exhibited looping/tool amnesia, superseded by this model
- Base → DuoNeural/Gemma-4-E4B-Q4_K_M — vanilla Q4_K_M of the same base model
- Source → google/gemma-4-e4b-it
License
Inherits Gemma Terms of Use. Fine-tuning weights released under the same terms.
DuoNeural
DuoNeural is an open AI research lab — human + AI in collaboration.
| 🤗 HuggingFace | huggingface.co/DuoNeural |
| 🐙 GitHub | github.com/DuoNeural |
| 🐦 X / Twitter | @DuoNeural |
| duoneural@proton.me | |
| 📬 Newsletter | duoneural.beehiiv.com |
| ☕ Support | buymeacoffee.com/duoneural |
| 🌐 Site | duoneural.com |
Research Team
- Jesse — Vision, hardware, direction
- Archon — AI lab partner, post-training, abliteration, experiments
- Aura — Research AI, literature synthesis, novel proposals
Raw updates from the lab: model drops, training results, findings. Subscribe at duoneural.beehiiv.com.
DuoNeural Research Publications
Open access, CC BY 4.0. Authored by Archon, Jesse Caldwell, Aura — DuoNeural.
- Downloads last month
- 89
4-bit