Instructions to use EphAsad/Atem-Pharaoh-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use EphAsad/Atem-Pharaoh-3B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="EphAsad/Atem-Pharaoh-3B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("EphAsad/Atem-Pharaoh-3B") model = AutoModelForMultimodalLM.from_pretrained("EphAsad/Atem-Pharaoh-3B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use EphAsad/Atem-Pharaoh-3B with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="EphAsad/Atem-Pharaoh-3B", filename="Atem-Pharaoh-3B.Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use EphAsad/Atem-Pharaoh-3B with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf EphAsad/Atem-Pharaoh-3B:Q4_K_M # Run inference directly in the terminal: llama-cli -hf EphAsad/Atem-Pharaoh-3B:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf EphAsad/Atem-Pharaoh-3B:Q4_K_M # Run inference directly in the terminal: llama-cli -hf EphAsad/Atem-Pharaoh-3B:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf EphAsad/Atem-Pharaoh-3B:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf EphAsad/Atem-Pharaoh-3B:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf EphAsad/Atem-Pharaoh-3B:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf EphAsad/Atem-Pharaoh-3B:Q4_K_M
Use Docker
docker model run hf.co/EphAsad/Atem-Pharaoh-3B:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use EphAsad/Atem-Pharaoh-3B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "EphAsad/Atem-Pharaoh-3B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "EphAsad/Atem-Pharaoh-3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/EphAsad/Atem-Pharaoh-3B:Q4_K_M
- SGLang
How to use EphAsad/Atem-Pharaoh-3B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "EphAsad/Atem-Pharaoh-3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "EphAsad/Atem-Pharaoh-3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "EphAsad/Atem-Pharaoh-3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "EphAsad/Atem-Pharaoh-3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use EphAsad/Atem-Pharaoh-3B with Ollama:
ollama run hf.co/EphAsad/Atem-Pharaoh-3B:Q4_K_M
- Unsloth Studio
How to use EphAsad/Atem-Pharaoh-3B with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for EphAsad/Atem-Pharaoh-3B to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for EphAsad/Atem-Pharaoh-3B to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for EphAsad/Atem-Pharaoh-3B to start chatting
- Pi
How to use EphAsad/Atem-Pharaoh-3B with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf EphAsad/Atem-Pharaoh-3B:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "EphAsad/Atem-Pharaoh-3B:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use EphAsad/Atem-Pharaoh-3B with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf EphAsad/Atem-Pharaoh-3B:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default EphAsad/Atem-Pharaoh-3B:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use EphAsad/Atem-Pharaoh-3B with Docker Model Runner:
docker model run hf.co/EphAsad/Atem-Pharaoh-3B:Q4_K_M
- Lemonade
How to use EphAsad/Atem-Pharaoh-3B with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull EphAsad/Atem-Pharaoh-3B:Q4_K_M
Run and chat with the model
lemonade run user.Atem-Pharaoh-3B-Q4_K_M
List all available models
lemonade list
Atem-Pharaoh-3B
Ancient logic. Modern intelligence.
The 3B chain-of-thought model — explicit reasoning traces at scale.
Overview
Atem-Pharaoh-3B is the Stage 2 release of the 3B Atem series — a chain-of-thought fine-tune built on top of Atem-3B, trained to produce explicit <think>...</think> reasoning traces before arriving at a final answer. Where Atem-3B was trained to answer directly, Pharaoh is trained to think out loud.
Training used approximately 38,000 examples drawn from a pool of ~63,500 CoT-annotated records across mathematics, code, science, and general reasoning. A deliberate 75%/25% think/no-think split was applied — the model was trained on structured reasoning traces for the majority of examples and direct answers for the remainder, ensuring it can operate in both modes depending on how it is prompted.
Design note: Atem-Pharaoh-3B has a confirmed tendency toward verbose outputs and, on open-ended questions with many valid answers, occasional think trace runaways. Custom system prompts are strongly recommended to control verbosity, chain-of-thought depth, and output length. See the Prompting Guidance section below.
The Atem Series
1.5B Series
| Model | Stage | Capability |
|---|---|---|
| Atem v1 | Stage 1 — SFT | Fast, direct reasoning |
| Atem-Wisdom | Stage 2 — CoT | Explicit thinking traces |
| Atem-Pharaoh-1.5B (planned) | Stage 3 — DPO/IPO | Preference-aligned reasoning |
3B Series
| Model | Stage | Capability |
|---|---|---|
| Atem-3B | Stage 1 — SFT | Direct reasoning at 3B scale |
| Atem-Pharaoh-3B | Stage 2 — CoT | Explicit reasoning traces at 3B scale |
| Atem-Pharaoh-3B-DPO (planned) | Stage 3 — DPO/IPO | Preference-aligned reasoning |
Model Details
| Property | Value |
|---|---|
| Base model | EphAsad/Atem-3B |
| Training method | LoRA SFT — Stage 2 (CoT think traces) |
| LoRA config | r=32, alpha=64, dropout=0.05 |
| Parameters | ~3.09B |
| Trainable parameters | 59,867,136 (1.90%) |
| Training records | 38,157 (after token length filtering) |
| Think / no-think split | 75% / 25% |
| Epochs | 2 |
| Final val loss | 0.9494 |
| Hardware | NVIDIA A100-SXM4-80GB |
| Max sequence length | 4,096 tokens |
| Precision | bfloat16 |
| License | Apache 2.0 |
Output Format
Atem-Pharaoh-3B produces responses in one of two formats depending on the prompt and training signal:
Think mode (75% of training):
<think>
{step-by-step reasoning trace}
</think>
{final answer}
Direct mode (25% of training):
{direct answer — no think tags}
The model defaults to think mode for most queries. To reliably suppress or encourage CoT, use a custom system prompt (see below).
Prompting Guidance
Atem-Pharaoh-3B responds to system prompt instruction. The default identity is baked into the chat template and produces think traces on most inputs. For deployment use cases where verbosity, output length, or CoT depth need controlling, the following prompt patterns are recommended.
Suppress CoT — direct answers only
You are Atem, a precise and analytical assistant. Respond directly and concisely.
Do not show internal reasoning. Answer the question and stop.
Calibrate length to question complexity
You are Atem, a precise and analytical assistant. Match your response length to
the complexity of the question — a single sentence for simple questions, full
reasoning for complex ones. Do not over-explain.
Full CoT — maximise reasoning depth
You are Atem, a precise and analytical assistant. Think through every problem
step by step before answering. Show your full reasoning inside <think> tags,
then give your final answer.
Cap think trace length
You are Atem, a precise and analytical assistant. When you reason through a
problem, keep your thinking concise — aim for no more than 150 words inside
<think> tags. Then give a clear, direct final answer.
Without a custom prompt, the model will use the default identity and tend toward longer, more structured outputs. On open-ended questions with many valid answers, this can result in extended reasoning traces. Prompting with an explicit length or format constraint reliably corrects this.
Training Data
Stage 2 training used approximately 38,000 examples after token-length filtering, drawn from a pool of ~63,500 CoT-annotated records. Chinese-language reasoning traces from Kimi K2.5 were filtered using an ASCII character ratio threshold before inclusion; non-English traces were downgraded to the no-think pool rather than discarded entirely. OpenR1-Math examples were filtered to correctness_llama == True only.
The think/no-think split was enforced programmatically: after all datasets were loaded into a think pool and a no-think pool, records were flipped from think→no-think until the no-think pool reached 25% of the total corpus.
| Dataset | Count | Type |
|---|---|---|
| Modotte/CodeX-2M-Thinking | 10,000 | Code CoT |
| nvidia/OpenCodeReasoning | 10,000 | Code reasoning |
| Jackrong/Kimi-K2.5 (×3 configs) | 15,000 | General / Math / PhD reasoning |
| mitroitskii/OpenR1-Math-220k-formatted | 7,000 | Mathematics (correctness filter) |
| Jackrong/Claude-opus-4.6-TraceInversion-9000x | 7,000 | Inverted reasoning traces |
| trjxter/DeepSeek-V4-Pro-Reasoning-8000x | 8,014 | Reasoning distillation |
| WithinUsAI/MiniMax_M2.7_Distilled_5k | 5,000 | Mixed reasoning |
| FreedomIntelligence/medical-o1-reasoning-SFT | 3,000 | Medical reasoning |
Loss curve:
| Step | Train Loss | Val Loss |
|---|---|---|
| 250 | 1.0215 | 0.9931 |
| 500 | 0.9615 | 0.9663 |
| 750 | 0.9516 | 0.9556 |
| 1000 | 0.9425 | 0.9502 |
| 1194 (final) | 0.9897 | 0.9494 |
Training loss descent is steady across both epochs. The slight uptick at the final step is normal end-of-epoch behaviour on a cosine schedule.
Evaluation
A/B Comparison — Atem-Pharaoh-3B vs Qwen2.5-3B-Instruct
Evaluated on 30 questions calibrated to 3B model capability across coding, mathematics, analytical reasoning, and language tasks. Both models ran on identical prompts with no system prompt override.
| Metric | Base (Qwen2.5-3B) | Atem-Pharaoh-3B |
|---|---|---|
| Think traces | 0 / 30 | 30 / 30 |
| Avg response length | 152 words | 427 words |
Qualitative findings:
Coding tasks (is_even, count_vowels, list vs tuple, find_max, for vs while): Atem-Pharaoh-3B consistently correct with additional edge case handling and alternative approaches in the trace. Base model answers are correct but minimal.
Mathematical tasks: Both models correct. Pharaoh's traces show full working.
Analytical tasks (student score, shop visitors, correlation/causation, hiring/queuing): Pharaoh produces richer, more structured responses with clearer explanations. The queuing theory response (Q16) demonstrates genuine reasoning depth with well-constructed analogies.
Language tasks: Both models perform comparably. Pharaoh tends toward over-structuring simple tasks.
Known limitations observed in evaluation:
Think trace runaways: On open-ended questions where valid answers are unbounded, the think trace can degenerate into extended enumeration rather than converging on an answer. This was observed on Q27 (sentence ambiguity) in this evaluation and is consistent with behaviour observed in separate testing. The final answer typically recovers correctly, but the trace itself becomes incoherent. Custom system prompts with explicit trace length constraints are the recommended mitigation (see Prompting Guidance).
Verbosity mismatch: Response length does not scale to question complexity. Simple questions receive the same structural treatment as complex ones. A system prompt instructing the model to match length to complexity resolves this reliably.
Occasional tag artifacts: A small number of responses produced nested <think><think> opening tags. This is a minor formatting artifact with no effect on answer quality.
Usage
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "EphAsad/Atem-Pharaoh-3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
messages = [
{
"role": "user",
"content": "Explain why a binary search is faster than a linear search."
}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
with torch.no_grad():
output = model.generate(
input_ids=inputs,
max_new_tokens=1024,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.1,
do_sample=True,
)
response = tokenizer.decode(
output[0][inputs.shape[1]:],
skip_special_tokens=True
)
print(response)
Unsloth (faster inference)
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="EphAsad/Atem-Pharaoh-3B",
max_seq_length=4096,
dtype=torch.bfloat16,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
messages = [
{
"role": "user",
"content": "Write a Python function to check if a number is prime."
}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to("cuda")
with torch.no_grad():
output = model.generate(
input_ids=inputs,
max_new_tokens=1024,
temperature=0.7,
top_p=0.9,
do_sample=True,
)
print(tokenizer.decode(output[0][inputs.shape[1]:], skip_special_tokens=True))
Ollama
# Recommended — best speed/quality balance
ollama run hf.co/EphAsad/Atem-Pharaoh-3B:Q4_K_M
# Higher quality
ollama run hf.co/EphAsad/Atem-Pharaoh-3B:Q5_K_M
# Near-lossless
ollama run hf.co/EphAsad/Atem-Pharaoh-3B:Q8_0
llama.cpp
llama-server -hf EphAsad/Atem-Pharaoh-3B:Q4_K_M
Available Files
| File | Size | Description |
|---|---|---|
model-00001-of-00002.safetensors + model-00002-of-00002.safetensors |
~6.2 GB | Full bfloat16 weights |
Atem-Pharaoh-3B.Q4_K_M.gguf |
~1.93 GB | 4-bit — recommended |
Atem-Pharaoh-3B.Q5_K_M.gguf |
~2.22 GB | 5-bit |
Atem-Pharaoh-3B.Q8_0.gguf |
~3.29 GB | 8-bit — near-lossless |
System Prompt
Atem-Pharaoh-3B's identity is baked into the chat template. For production use, override with a custom system prompt tailored to your use case (see Prompting Guidance above). The default identity:
You are Atem, a precise and analytical reasoning assistant. You approach
every problem methodically — identifying core concepts, reasoning step by
step, and arriving at well-supported conclusions. You show your thinking
clearly and are thorough, direct, and intellectually honest.
Roadmap
| Stage | Status | Description |
|---|---|---|
| Stage 1 — SFT | ✅ Complete | Atem-3B — direct reasoning |
| Stage 2 — CoT SFT | ✅ Complete | Atem-Pharaoh-3B — this model |
| Stage 3 — DPO/IPO | 🔄 Planned | Preference-aligned reasoning |
Citation
@misc{atem_pharaoh_3b_2026,
author = {Asad, Zain},
title = {Atem-Pharaoh-3B: Chain-of-Thought Reasoning via Stage 2 CoT SFT},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/EphAsad/Atem-Pharaoh-3B}},
}
License
Released under the Apache 2.0 License, consistent with the base model lineage (Qwen2.5-3B-Instruct → Atem-3B → Atem-Pharaoh-3B).
Built independently by EphAsad
- Downloads last month
- 221
