Instructions to use Quazim0t0/Positronic-144M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Quazim0t0/Positronic-144M with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Quazim0t0/Positronic-144M")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Quazim0t0/Positronic-144M", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Quazim0t0/Positronic-144M with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Quazim0t0/Positronic-144M" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Quazim0t0/Positronic-144M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Quazim0t0/Positronic-144M
- SGLang
How to use Quazim0t0/Positronic-144M with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Quazim0t0/Positronic-144M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Quazim0t0/Positronic-144M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Quazim0t0/Positronic-144M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Quazim0t0/Positronic-144M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Quazim0t0/Positronic-144M with Docker Model Runner:
docker model run hf.co/Quazim0t0/Positronic-144M
- Positronic v1
- Architecture at a glance
- 1. QuazimotoBlock — the oscillator channel mixer
- 2. Attention — MLA / partial-RoPE / GQA
- 3. Interstitial rings (
use_rings=True) - 4. Per-ring self-organizing controllers (
use_ring_controllers=True) - 5. Per-ring memory specialists (
use_ring_specialists=True) - 6. Trunk refinements & auxiliary heads
- 7. Fractal (Mandelbrot) phase seeding (
use_fractal_phase_seed=True)
- Verification (this checkpoint)
- Intended use & limitations
- Usage
- Statefulness note (ring-specialist test-time memory)
- Training data
- Files
- Architecture at a glance
Positronic v1
Positronic is a 144M-parameter causal language model with a non-standard channel-mixing block. It keeps a conventional transformer backbone for token mixing (attention), but replaces the per-layer MLP with a QuazimotoBlock: a bank of coupled phase oscillators (Kuramoto dynamics) arranged in concentric rings, run for a few differentiable Euler steps and read out through [cos θ, sin θ].
The name Positronic is coined from Data's positronic brain in Star Trek — a nod to a mind built on unconventional internal dynamics rather than a standard feed-forward stack.
This is a research / hobby architecture, not a production model. At 144M parameters and a short training run it is fluent in ChatML formatting and reproduces reasoning structure (e.g. <think>…</think>), but its factual accuracy and arithmetic are unreliable. It is interesting for what it is, not what it knows.
- Base checkpoint:
chkpt/quazimoto.pt(pretraining, ~37k steps) - This release (SFT):
chkpt/quazimoto_sft.pt(chat/instruction SFT, 8k steps) - Params: 144.3M
- Vocab: 16,512 (byte-level length-max tokenizer + appended special tokens)
- Context:
block_size2024 (RoPEmax_position_embeddings4096)
Architecture at a glance
| Component | Setting |
|---|---|
| Layers | 10 |
d_model |
768 |
| Attention heads | 12 (GQA: 4 KV heads) |
| Head dim | 64 = 32 NoPE + 32 RoPE (partial RoPE) |
| Q / O projections | MLA low-rank (rank 384) |
| QK-Norm | on |
| Channel mixer | QuazimotoBlock (508 oscillators / token) |
| Oscillator rings | 7 concentric: (4, 8, 16, 32, 64, 128, 256) |
| Kuramoto Euler steps | 4 (osc_dt 0.5) |
| Readout MLP expansion | 3× on the [cos, sin] vector |
| Norm | RMSNorm (pre-norm), no biases |
| LM head | tied to token embedding, learned logit_scale |
| Logit discipline | z-loss (1e-4) |
Every layer is x → x + Attention(norm(x)) then x → x + QuazimotoBlock(x). All the exotic machinery lives in (or around) the QuazimotoBlock.
1. QuazimotoBlock — the oscillator channel mixer
This is the defining component. Instead of an MLP, each token's hidden state drives a bank of N = 508 phase oscillators laid out on 7 concentric rings:
- Project to phases and frequencies. Two linear maps produce initial phases
θ[B,T,N]and bounded natural frequenciesω = tanh(...). - Structured Kuramoto coupling. The oscillators are coupled through a learnable
R×R(7×7) block-gain matrix between rings — strong on-diagonal (within-ring) coupling, weaker neighbour-ring coupling at init. Because coupling is block-constant, the update uses per-ring mean fields (order parameters), giving anO(N·R)update instead of theO(N²)pairwise Kuramoto sum. A per-ring learnable frustrationαshifts the coupling phase. - A rotating "center ball." A single global phase (
center_phase, advancing atcenter_omega) pulls every oscillator toward it via a per-ringcenter_gain— a hierarchy-to-center coupling that is the "Quazimoto" twist on plain oscillatory neurons. - Integrate.
osc_steps=4differentiable Euler steps advanceθunderω + coupling + center-pull + external drive. - Read out. Concatenate
[cos θ, sin θ]→[B,T,2N], pass through a 3×-expansion GELU MLP back tod_model. - Family gate + soft-clamp. Output is scaled by
tanh(gate)and passed throughsoft_clamp(anerfsaturating clamp, bound 10) for BPTT stability. Unlike the auxiliary traits below, this gate is initialised open (~0.9) because the oscillator bank is the main mixer.
Degenerate check: set ring coupling to dense and
α=0and the block reduces toward AKOrN-style oscillatory neurons. The ring structure + center coupling is what makes it Quazimoto.
2. Attention — MLA / partial-RoPE / GQA
Ported from a "v2" transformer. Multi-head Latent Attention (low-rank q/o projections, rank 384), partial RoPE (each 64-dim head splits into a 32-dim no-position part and a 32-dim RoPE part), QK-Norm (per-head RMSNorm before RoPE), and GQA (4 KV heads shared across 12 query heads). Optional DERF (erf-attention) and XSA (value-subspace removal) exist but are off in this checkpoint. A GQA-compact KV cache supports incremental decoding.
3. Interstitial rings (use_rings=True)
Between each pair of adjacent oscillator rings sits a slot that absorbs context once and injects a constant drive into its two neighbour rings during the Kuramoto integration:
- PhaseAttentionRing — causal self-attention in phase space (
[cos,sin]of the two neighbour rings) producing an injection current. - EngramRing — an n-gram hash memory: a frozen LSH-style compressor hashes 1–3-token windows into per-head embedding tables, gated by a DERF context gate, projected into an injection current.
4. Per-ring self-organizing controllers (use_ring_controllers=True)
One tiny RingController per ring (shared across layers). Each observes its ring's order-parameter magnitude/phase and mean frequency, and emits no-op-at-init modulations of that ring's coupling, frustration, center-pull, and injection. Its core is a fast-weight predictor updated online by a delta rule (one gradient step on its own prediction error — surprise minimization), not by backprop; only a small zero-init decoder is backprop-trained.
5. Per-ring memory specialists (use_ring_specialists=True)
A small MoE of 7 mini "memory specialists" per ring (top-2 routed). Each specialist owns two test-time-writable stores (store_in, store_out) that accumulate an addressable context memory during generation via a gated EMA write. Tokens route to specialists by similarity to a learnable identity key plus a read of the accumulated input memory; the routed store_out is decoded into a ring injection current.
Important for inference: these stores are mutated at inference time and persist across forward calls. See "Known issue" below.
6. Trunk refinements & auxiliary heads
Applied after the layer stack, before the final RMSNorm + LM head:
- HRM (
use_hrm=True) — Hierarchical Reasoning Module: iterative gated refinement that starts from a random learned statez0and reconciles it against the trunk over 3 steps, contributing only the reasoning delta. Gates start open. - MoE SwiGLU (
use_moe=True) — shared + top-2-of-4 routed SwiGLU experts with load-balance aux loss; down-projections zero-init (no-op at start). - MTP (
use_mtp=True) — 4 multi-token-prediction draft heads. A train-time auxiliary loss, and at inference they enable self-speculative decoding (--speculative): the heads draft the next 4 tokens, the main head verifies them in one parallel forward. Output is bit-identical to greedy decoding, only faster. - JEPA (
use_jepa=True) — representation-space k-ahead prediction with a stop-grad target. Train-time only (auxiliary loss); no effect at inference.
7. Fractal (Mandelbrot) phase seeding (use_fractal_phase_seed=True)
Each token id maps to a complex point c (spread over the Mandelbrot region by a 2D Halton sequence); the angles of its z ← z² + c orbit seed the oscillator phases, added through a zero-init gate. A frozen, parameter-free, token-specific dynamical prior congruent with what the Kuramoto block already does. The [16512, 508] table is regenerated deterministically on load (not stored in the checkpoint).
Family "safe-at-init" contract: every auxiliary block (rings, engram, specialists, HRM, MoE, MTP, JEPA, fractal seed) is gated so it is a no-op at initialization while its content weights still receive gradient — the model boots as a clean attention+oscillator transformer and the optimizer opens each trait only if it helps.
Verification (this checkpoint)
I confirmed the released SFT checkpoint actually uses the full custom architecture (not a fallback):
- Strict
state_dictload succeeds — every checkpoint tensor maps onto the model with none missing or unexpected. (generate.pyloads withstrict=False, but strict passes too.) - All traits are active in
family_config: HRM, MoE, MTP, JEPA, Rings, Controllers, Specialists(7/ring), FractalSeed. - Training opened the oscillator block: per-layer output gates
tanh(gate)climb from ~0.31 (layer 0) to ~0.99 (deep layers) — the oscillators dominate the later layers. - Ablations change real-prompt logits (max |Δlogit| on one assistant-turn prompt): zeroing the oscillator gates 31.3 (argmax flips), MoE 12.7, ring controllers 8.1, HRM 3.2, fractal seed 0.63, interstitial rings 0.12, ring specialists 0.06. So the big mixers are load-bearing; the fractal seed and interstitial rings are present but contribute weakly at this checkpoint.
- KV-cache path is correct: greedy decoding with
use_cache=Trueis bit-identical to full recompute over 40 tokens.
Intended use & limitations
- Intended: research into oscillatory / Kuramoto channel mixing, custom-architecture inference, and as a small ChatML/
<think>demonstrator. - Not intended: factual QA, arithmetic, or any use requiring correctness. At 144M params with a short run, content is frequently wrong or incoherent even when the format is right.
- No safety tuning / RLHF. Outputs may be nonsensical or undesirable. English-centric.
Usage
transformers note
This model uses a custom architecture (QuazimotoLM in model.py), so it does not load via AutoModelForCausalLM. Use the repo's model.py + generate.py. The tokenizer is HF-compatible via SpikeTokenizer (spike_tokenizer.py).
Chat generation (recommended for the SFT model)
python generate.py --ckpt chkpt/quazimoto_sft.pt --chat \
--prompt "Hello, who are you?" --max_new_tokens 200
Chat mode wraps the prompt in ChatML (<|im_start|><|user|> … <|im_end|> then an <|assistant|> header) and stops on <|im_end|>.
Reasoning with <think>
The SFT mix included short-reasoning data, so the model knows the <think>…</think> format but does not always open the tag on its own. To force it, prime the assistant turn ending in <think>.
Speculative decoding (faster, greedy-identical)
python generate.py --ckpt chkpt/quazimoto_sft.pt --chat --speculative --spec_stats \
--prompt "Explain the plan." --max_new_tokens 200
Programmatic
import torch
from model import QuazimotoLM, QuazimotoConfig
from generate import load_tokenizer, generate, build_prompt, resolve_stop_ids
tok = load_tokenizer(".")
ck = torch.load("chkpt/quazimoto_sft.pt", map_location="cpu", weights_only=False)
model = QuazimotoLM(QuazimotoConfig(**ck["family_config"]))
model.load_state_dict(ck["model"]); model.eval()
model.reset_ring_memory() # clear test-time specialist memory per request
ids, _ = build_prompt(tok, "Hello!", chat=True, system="")
x = torch.tensor([ids])
out = generate(model, model.cfg, x, 200, stop_ids=resolve_stop_ids(tok, True))
print(tok.decode(out[0, len(ids):].tolist(), skip_special_tokens=True))
Special tokens
ChatML (<|im_start|>, <|im_end|>, <|system|>, <|user|>, <|assistant|>), reasoning (<think>, </think>, <begin_solution>, <end_solution>), tool calls (<tool_call>, <tool_response>), code FIM (<|fim_prefix|>, <|fim_middle|>, <|fim_suffix|>), and <|endoftext|>.
Statefulness note (ring-specialist test-time memory)
The ring specialists write to persistent stores that carry over across forward calls (store_out norm goes 0.0 → ~44.5 after one forward), so in principle memory from one prompt can leak into the next. At this checkpoint the practical effect is negligible: the specialist output gates barely opened during the short SFT (tanh(scale) in −0.10…+0.17), so cross-prompt contamination perturbs next-token logits by only ~0.007 (max) and did not flip a single greedy token in testing. It is the weakest of all the model's components.
It is still handled correctly for reproducibility and to future-proof checkpoints where those gates open wider: generate.py now calls model.reset_ring_memory() before every prompt (REPL turn / single run). Flags:
--keep_ring_memory— opt out of the per-prompt reset (let memory accumulate across turns).--freeze_ring_memory— disable online writes entirely for fully stateless, reproducible generation (model.set_ring_memory_writing(False)).
A Gradio wrapper that reuses one loaded model gets clean per-request behavior through the same run_once path; if you call generate() directly, call model.reset_ring_memory() yourself at the start of each request.
Training data
- Pretraining blend: ~35% Ultra-FineWeb-L3, 25% FineWeb-Edu, 25% FineMath, 15%
Quazim0t0/PretrainNew. - SFT blend (this release): ~equal quarters of
mlabonne/ultrachat_200k_sft,mlabonne/ultrafeedback-sft,openbmb/UltraData-SFT-2605(Knowledge, no-think split), andvolcanos/OpenThoughts2-1M-ShortThink(reasoning). Loss is masked to assistant turns (standard instruction SFT).
Files
model.py (architecture), family.py (trait modules), fractal.py (Mandelbrot seeding), generate.py (inference harness), spike_tokenizer.py + tokenizer.json (tokenizer), special_tokens.py (special-token registry), chkpt/quazimoto_sft.pt (this release).