How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="EphAsad/Atem-1.7B",
	filename="",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Atem Logo

Atem-1.7B

Ancient logic. Modern intelligence.

A 1.7B reasoning model trained via a single CoT-preserving SFT pass directly on Qwen3-1.7B, distilling multi-domain reasoning capability from frontier teacher models while keeping the base model's native thinking capability intact.

Base ModelMethodParametersLicense


Overview

Atem-1.7B is a 1.7B parameter reasoning model built via a single supervised fine-tuning pass on raw Qwen3-1.7B, using the same CoT-preserving single-pass design as Atem-4B and Atem-8B. It is the most compute-efficient model in the Atem series, completing training in under 2.5 hours on an A100-SXM4 80GB while maintaining 2.95% proportional LoRA capacity — close to the series-wide 3% target.

This model includes GSM8K-format training examples (5K no-think records) to partially restore the #### answer convention that the reasoning corpus otherwise overwrites — an improvement over Atem-4B and Atem-8B, which did not include these.


Model Details

Property Value
Base model Qwen/Qwen3-1.7B
Training method Single-pass CoT-Preserving LoRA SFT
LoRA config r=48, alpha=96, dropout=0.05
Target modules q, k, v, o, gate, up, down projections
Parameters ~1.77B
Trainable (LoRA) params 52,297,728 (2.95% of base)
Training records 62,301 (after token-length filtering)
Think / No-think split 85% / 15%
Epochs 2 (ceiling; early stopping patience=3, never triggered)
Effective batch size 64 (batch 16 × grad accum 4)
Learning rate 1e-4, cosine schedule, 5% warmup
Max sequence length 6,144 tokens
Precision bfloat16 (full 16-bit LoRA, not QLoRA)
Hardware NVIDIA A100-SXM4 80GB
Runtime 2h28m
License Apache 2.0

Design Notes

Single combined pass. The same single CoT-preserving pass design used across Atem-4B and Atem-8B — no erase-then-rebuild pipeline. Reasoning capability is built directly on the base model's intact native foundation.

r=48 for proportional capacity. r=32 on a 1.7B model represents only 2.05% of the model's parameters — the same shrinking-fraction problem observed across the series as model size grows. r=48 recovers 2.95% proportional capacity, close to the series-wide ~3% target and significantly better than r=32 would have provided.

GSM8K format restoration. The standard Atem training corpus uses \boxed{} notation throughout. Atem-4B and Atem-8B both showed a systematic GSM8K strict-match regression as a result of this format shift. Atem-1.7B is the first in the series to include 5,000 GSM8K-format training examples (from openai/gsm8k) in the no-think pool, partially re-establishing the #### answer convention alongside \boxed{}.

Full 16-bit LoRA. At 1.7B the model weights occupy only ~3.4GB, leaving over 75GB of A100 headroom. Full 16-bit LoRA is used throughout — faster and marginally more accurate than QLoRA without any VRAM constraint.


Intended Use

Atem-1.7B is suited for reasoning tasks on resource-constrained hardware — edge devices, local deployment, and applications where a 4B+ model is impractical:

  • Multi-step mathematical reasoning
  • Code explanation, implementation, and debugging
  • Analytical reasoning across diverse domains
  • Commonsense reasoning and physical intuition
  • Logic and argument evaluation

For higher capability at the cost of resource requirements, Atem-4B and Atem-8B provide progressively stronger results on the same reasoning tasks.


Training Data

Atem-1.7B was trained on the same eight-source reasoning corpus as Atem-4B and Atem-8B, with the addition of 5,000 GSM8K-format records to partially restore the #### answer convention. All sources include explicit chain-of-thought reasoning traces; 85% of training records were formatted with full think traces and 15% as direct answers.

Dataset Records Source / Teacher
mitroitskii/OpenR1-Math-220k-formatted ~10,938 DeepSeek-R1 — Mathematics (correctness-filtered)
Jackrong/Claude-opus-4.6-TraceInversion-9000x 7,000 Claude Opus 4.6 — Trace Inversion
Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned (General-Math) 8,000 Kimi K2.5 — Mathematical Reasoning
Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned (General-Distillation) 8,000 Kimi K2.5 — General Reasoning
Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned (PHD-Science) 8,000 Kimi K2.5 — Scientific Reasoning
WithinUsAI/MiniMax_M2.7_Distilled_5k 5,000 MiniMax M2.7
FreedomIntelligence/medical-o1-reasoning-SFT 7,500 Medical reasoning (English config)
Modotte/CodeX-2M-Thinking 15,000 Mixed — Coding with CoT
trjxter/DeepSeek-V4-Pro-Reasoning-8000x ~8,014 DeepSeek-V4-Pro
nvidia/OpenCodeReasoning 15,000 Mixed — Competitive coding
openai/gsm8k (no-think) 5,000 GSM8K #### answer format restoration
Total (pre-filter pool) 96,017
Total (post-filter, trained on) 62,301

Non-English reasoning traces (primarily CJK) were filtered at the trace level using an ASCII-ratio threshold and retained as no-think records. The 34.3% filter rate is consistent with Atem-4B (32.7%) and Atem-8B (34.3%) at the same 6,144-token ceiling.


Training Configuration

# Key hyperparameters
lora_r             = 48
lora_alpha         = 96
lora_dropout       = 0.05
max_seq_length     = 6144
learning_rate      = 1e-4
lr_scheduler       = 'cosine'
warmup_ratio       = 0.05
batch_size         = 16
grad_accumulation  = 4           # effective batch size: 64
num_epochs         = 2           # ceiling — early stopping patience=3
eval_steps         = 150
early_stopping_patience   = 3
early_stopping_threshold  = 0.001
nothink_ratio      = 0.15
load_in_4bit       = False       # full 16-bit LoRA
dtype              = bfloat16

Loss Curve

Step Train Loss Val Loss
150 1.0706 1.0833
300 1.0385 1.0520
450 1.0566 1.0372
600 0.9990 1.0255
750 1.0082 1.0158
900 0.9887 1.0091
1050 0.9294 1.0051
1200 0.8906 1.0020
1350 0.9331 0.9993
1500 0.9780 0.9973
1650 0.9467 0.9963
1800 0.9341 0.9957
Final (1948) 0.9902 (avg) 0.9956

Train loss is noisier than in larger Atem models — characteristic of smaller models with a diverse multi-domain corpus. Validation loss improved monotonically across all 13 checkpoints without exception. Early stopping was configured but never triggered.


Evaluation

Benchmark Results

Evaluated against base Qwen3-1.7B (Qwen/Qwen3-1.7B) using lm-evaluation-harness. Both models were loaded in 4-bit for evaluation. Statistical significance (σ) is provided as context for interpreting each result — at 1.7B scale, several deltas that appear directionally positive are within sampling noise due to test set size.

Task Base (Qwen3-1.7B) Atem-1.7B Delta σ
ARC-Challenge (0-shot, acc_norm) 40.7% 42.2% +1.5pp ✓ 0.7σ
GSM8K strict (5-shot, exact_match) 62.0% 58.7% −3.3pp ⚠ 1.7σ
HellaSwag (0-shot, acc_norm) 59.4% 61.3% +1.9pp 2.8σ
MMLU (0-shot, acc) 55.4% 56.2% +0.8pp ✓ 1.3σ
Winogrande (0-shot, acc) 61.8% 61.1% −0.7pp ⚠ 0.4σ
PIQA (0-shot, acc) 71.4% 71.4% +0.0pp — 0.0σ
OpenBookQA (0-shot, acc_norm) 36.0% 39.0% +3.0pp ✓ 1.0σ
BoolQ (0-shot, acc) 76.5% 76.0% −0.5pp — 0.5σ

HellaSwag (+1.9pp, 2.8σ) is the only clearly statistically significant positive result. It uses normalised log-likelihood scoring over multiple-choice options — format-independent and not influenced by generation style. This is also the most consistent signal across the full Atem series (1.7B: +1.9pp, 4B: +2.9pp, 8B: +1.7pp), confirming genuine commonsense reasoning transfer from the CoT training corpus.

OpenBookQA (+3.0pp) is directionally strong but the test set is only 500 questions, giving 1.0σ — treat this as encouraging rather than conclusive.

Winogrande (−0.7pp, ⚠) despite the flag is 0.4σ and statistically indistinguishable from noise. Not a meaningful regression.

MMLU (+0.8pp, 1.3σ) is borderline. Consistent with the series pattern — neither model has a knowledge breadth advantage after CoT training.

Results at 1.7B are generally less pronounced than at 4B and 8B, as expected: smaller models with proportionally larger parameter changes per training step exhibit noisier benchmark behaviour, and the absolute capability headroom above random baselines is narrower.

GSM8K — Formatting Shift

The strict-match regression (−3.3pp) follows the same pattern established at 4B and 8B: the training corpus uses \boxed{} notation, systematically shifting away from the #### format that lm_eval's strict-match extraction expects. At 1.7B the base model scores 62.0% — above the threshold where formatting effects dominate over raw capability gains (the 0.6B base at 26.7% was below this threshold and actually improved on strict-match).

Atem-1.7B is the first model in the series to include GSM8K-format (#### answer) training examples. At 5,000 records out of 62,301 total (8%), this partially offsets the shift but does not eliminate it — larger proportions would be needed for full recovery. Based on the flexible-extraction recovery rate confirmed at 8B (68% of regression recovered), the estimated true capability gap is approximately −1.1pp rather than −3.3pp.


Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "EphAsad/Atem-1.7B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {
        "role": "user",
        "content": "Explain why the harmonic mean is used for average speeds rather than the arithmetic mean."
    }
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

with torch.no_grad():
    output = model.generate(
        input_ids=inputs,
        max_new_tokens=2000,
        temperature=0.6,
        top_p=0.95,
        top_k=20,
        do_sample=True,
        repetition_penalty=1.1,
    )

response = tokenizer.decode(
    output[0][inputs.shape[1]:],
    skip_special_tokens=True
)
print(response)

Unsloth (faster inference)

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="EphAsad/Atem-1.7B",
    max_seq_length=6144,
    dtype=torch.bfloat16,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

messages = [
    {
        "role": "user",
        "content": "What is the time complexity of merge sort and why?"
    }
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda")

with torch.no_grad():
    output = model.generate(
        input_ids=inputs,
        max_new_tokens=2000,
        temperature=0.6,
        top_p=0.95,
        top_k=20,
        do_sample=True,
    )

print(tokenizer.decode(
    output[0][inputs.shape[1]:],
    skip_special_tokens=True
))

Ollama

# Recommended — best speed/quality balance
ollama run hf.co/EphAsad/Atem-1.7B:Q4_K_M

# Higher quality
ollama run hf.co/EphAsad/Atem-1.7B:Q5_K_M

# Near-lossless
ollama run hf.co/EphAsad/Atem-1.7B:Q8_0

llama.cpp

llama-server -hf EphAsad/Atem-1.7B:Q4_K_M

Sampling Parameters

Use temperature=0.6, top_p=0.95, top_k=20 — Qwen3's published recommendation for thinking mode. Do not use greedy decoding with thinking mode enabled.

System Prompt

Atem-1.7B's identity is baked into the chat template and activates automatically without an explicit system message. For manual override:

You are Atem, a precise and analytical reasoning assistant. You approach
every problem methodically — identifying core concepts, reasoning step by
step, and arriving at well-supported conclusions. You show your thinking
clearly and are thorough, direct, and intellectually honest.

Available Files

File Size Description
model.safetensors 3.44 GB Full bfloat16 merged weights (single shard)
Atem-1.7b.Q4_K_M.gguf 1.11 GB 4-bit quantised — recommended
Atem-1.7b.Q5_K_M.gguf 1.26 GB 5-bit quantised
Atem-1.7b.Q8_0.gguf 1.83 GB 8-bit quantised — near-lossless

Known Limitations

GSM8K formatting shift. As documented in the evaluation section, the training corpus uses \boxed{} for mathematical answers. Despite the inclusion of 5,000 GSM8K-format examples, the strict-match regression persists at −3.3pp. The estimated true capability gap under flexible extraction is approximately −1.1pp. Future runs with a higher proportion of GSM8K-format examples would reduce this further.

Statistical modesty at 1.7B. Most benchmark deltas at this scale are within sampling noise — HellaSwag is the exception (2.8σ). This is expected: 1.7B models have narrower performance headroom and proportionally larger variance per benchmark question. The reasoning improvements are real but harder to detect reliably at smaller scale.

6,144 token sequence ceiling. The longest reasoning traces (advanced mathematics, competitive programming) were dropped during formatting. The model has not been trained on very long chain-of-thought traces.

No RLHF or DPO. Atem-1.7B has not undergone preference optimisation.


Roadmap

  • Atem-14B: Single CoT-preserving pass on Qwen3-14B, r=128 (3.10% proportional capacity), with expanded GSM8K-format and camel-ai/chemistry additions to the corpus

Citation

@misc{atem_1b7_2026,
  author       = {Asad, Zain},
  title        = {Atem-1.7B: A 1.7B CoT-Preserving Reasoning Model via
                  Single-Pass SFT on Qwen3},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/EphAsad/Atem-1.7B}},
}

License

Released under the Apache 2.0 License, consistent with the base model Qwen/Qwen3-1.7B.


Built independently by Zain Asad — EphAsad

Downloads last month
232
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EphAsad/Atem-1.7B

Finetuned
Qwen/Qwen3-1.7B
Adapter
(542)
this model
Adapters
2 models

Datasets used to train EphAsad/Atem-1.7B