You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Positronic v1

Positronic is a 144M-parameter causal language model with a non-standard channel-mixing block. It keeps a conventional transformer backbone for token mixing (attention), but replaces the per-layer MLP with a QuazimotoBlock: a bank of coupled phase oscillators (Kuramoto dynamics) arranged in concentric rings, run for a few differentiable Euler steps and read out through [cos θ, sin θ].

The name Positronic is coined from Data's positronic brain in Star Trek — a nod to a mind built on unconventional internal dynamics rather than a standard feed-forward stack.

This is a research / hobby architecture, not a production model. At 144M parameters and a short training run it is fluent in ChatML formatting and reproduces reasoning structure (e.g. <think>…</think>), but its factual accuracy and arithmetic are unreliable. It is interesting for what it is, not what it knows.

Base checkpoint: chkpt/quazimoto.pt (pretraining, ~37k steps)
This release (SFT): chkpt/quazimoto_sft.pt (chat/instruction SFT, 8k steps)
Params: 144.3M
Vocab: 16,512 (byte-level length-max tokenizer + appended special tokens)
Context: block_size 2024 (RoPE max_position_embeddings 4096)

Architecture at a glance

Component	Setting
Layers	10
`d_model`	768
Attention heads	12 (GQA: 4 KV heads)
Head dim	64 = 32 NoPE + 32 RoPE (partial RoPE)
Q / O projections	MLA low-rank (rank 384)
QK-Norm	on
Channel mixer	QuazimotoBlock (508 oscillators / token)
Oscillator rings	7 concentric: (4, 8, 16, 32, 64, 128, 256)
Kuramoto Euler steps	4 (`osc_dt` 0.5)
Readout MLP expansion	3× on the `[cos, sin]` vector
Norm	RMSNorm (pre-norm), no biases
LM head	tied to token embedding, learned `logit_scale`
Logit discipline	z-loss (`1e-4`)

Every layer is x → x + Attention(norm(x)) then x → x + QuazimotoBlock(x). All the exotic machinery lives in (or around) the QuazimotoBlock.

1. QuazimotoBlock — the oscillator channel mixer

This is the defining component. Instead of an MLP, each token's hidden state drives a bank of N = 508 phase oscillators laid out on 7 concentric rings:

Project to phases and frequencies. Two linear maps produce initial phases θ [B,T,N] and bounded natural frequencies ω = tanh(...).
Structured Kuramoto coupling. The oscillators are coupled through a learnable R×R (7×7) block-gain matrix between rings — strong on-diagonal (within-ring) coupling, weaker neighbour-ring coupling at init. Because coupling is block-constant, the update uses per-ring mean fields (order parameters), giving an O(N·R) update instead of the O(N²) pairwise Kuramoto sum. A per-ring learnable frustration α shifts the coupling phase.
A rotating "center ball." A single global phase (center_phase, advancing at center_omega) pulls every oscillator toward it via a per-ring center_gain — a hierarchy-to-center coupling that is the "Quazimoto" twist on plain oscillatory neurons.
Integrate. osc_steps=4 differentiable Euler steps advance θ under ω + coupling + center-pull + external drive.
Read out. Concatenate [cos θ, sin θ] → [B,T,2N], pass through a 3×-expansion GELU MLP back to d_model.
Family gate + soft-clamp. Output is scaled by tanh(gate) and passed through soft_clamp (an erf saturating clamp, bound 10) for BPTT stability. Unlike the auxiliary traits below, this gate is initialised open (~0.9) because the oscillator bank is the main mixer.

Degenerate check: set ring coupling to dense and α=0 and the block reduces toward AKOrN-style oscillatory neurons. The ring structure + center coupling is what makes it Quazimoto.

2. Attention — MLA / partial-RoPE / GQA

Ported from a "v2" transformer. Multi-head Latent Attention (low-rank q/o projections, rank 384), partial RoPE (each 64-dim head splits into a 32-dim no-position part and a 32-dim RoPE part), QK-Norm (per-head RMSNorm before RoPE), and GQA (4 KV heads shared across 12 query heads). Optional DERF (erf-attention) and XSA (value-subspace removal) exist but are off in this checkpoint. A GQA-compact KV cache supports incremental decoding.

3. Interstitial rings (`use_rings=True`)

Between each pair of adjacent oscillator rings sits a slot that absorbs context once and injects a constant drive into its two neighbour rings during the Kuramoto integration:

PhaseAttentionRing — causal self-attention in phase space ([cos,sin] of the two neighbour rings) producing an injection current.
EngramRing — an n-gram hash memory: a frozen LSH-style compressor hashes 1–3-token windows into per-head embedding tables, gated by a DERF context gate, projected into an injection current.

4. Per-ring self-organizing controllers (`use_ring_controllers=True`)

One tiny RingController per ring (shared across layers). Each observes its ring's order-parameter magnitude/phase and mean frequency, and emits no-op-at-init modulations of that ring's coupling, frustration, center-pull, and injection. Its core is a fast-weight predictor updated online by a delta rule (one gradient step on its own prediction error — surprise minimization), not by backprop; only a small zero-init decoder is backprop-trained.

5. Per-ring memory specialists (`use_ring_specialists=True`)

A small MoE of 7 mini "memory specialists" per ring (top-2 routed). Each specialist owns two test-time-writable stores (store_in, store_out) that accumulate an addressable context memory during generation via a gated EMA write. Tokens route to specialists by similarity to a learnable identity key plus a read of the accumulated input memory; the routed store_out is decoded into a ring injection current.

Important for inference: these stores are mutated at inference time and persist across forward calls. See "Known issue" below.

6. Trunk refinements & auxiliary heads

Applied after the layer stack, before the final RMSNorm + LM head:

HRM (use_hrm=True) — Hierarchical Reasoning Module: iterative gated refinement that starts from a random learned state z0 and reconciles it against the trunk over 3 steps, contributing only the reasoning delta. Gates start open.
MoE SwiGLU (use_moe=True) — shared + top-2-of-4 routed SwiGLU experts with load-balance aux loss; down-projections zero-init (no-op at start).
MTP (use_mtp=True) — 4 multi-token-prediction draft heads. A train-time auxiliary loss, and at inference they enable self-speculative decoding (--speculative): the heads draft the next 4 tokens, the main head verifies them in one parallel forward. Output is bit-identical to greedy decoding, only faster.
JEPA (use_jepa=True) — representation-space k-ahead prediction with a stop-grad target. Train-time only (auxiliary loss); no effect at inference.

7. Fractal (Mandelbrot) phase seeding (`use_fractal_phase_seed=True`)

Each token id maps to a complex point c (spread over the Mandelbrot region by a 2D Halton sequence); the angles of its z ← z² + c orbit seed the oscillator phases, added through a zero-init gate. A frozen, parameter-free, token-specific dynamical prior congruent with what the Kuramoto block already does. The [16512, 508] table is regenerated deterministically on load (not stored in the checkpoint).

Family "safe-at-init" contract: every auxiliary block (rings, engram, specialists, HRM, MoE, MTP, JEPA, fractal seed) is gated so it is a no-op at initialization while its content weights still receive gradient — the model boots as a clean attention+oscillator transformer and the optimizer opens each trait only if it helps.

Verification (this checkpoint)

I confirmed the released SFT checkpoint actually uses the full custom architecture (not a fallback):

Strict state_dict load succeeds — every checkpoint tensor maps onto the model with none missing or unexpected. (generate.py loads with strict=False, but strict passes too.)
All traits are active in family_config: HRM, MoE, MTP, JEPA, Rings, Controllers, Specialists(7/ring), FractalSeed.
Training opened the oscillator block: per-layer output gates tanh(gate) climb from ~0.31 (layer 0) to ~0.99 (deep layers) — the oscillators dominate the later layers.
Ablations change real-prompt logits (max |Δlogit| on one assistant-turn prompt): zeroing the oscillator gates 31.3 (argmax flips), MoE 12.7, ring controllers 8.1, HRM 3.2, fractal seed 0.63, interstitial rings 0.12, ring specialists 0.06. So the big mixers are load-bearing; the fractal seed and interstitial rings are present but contribute weakly at this checkpoint.
KV-cache path is correct: greedy decoding with use_cache=True is bit-identical to full recompute over 40 tokens.

Intended use & limitations

Intended: research into oscillatory / Kuramoto channel mixing, custom-architecture inference, and as a small ChatML/<think> demonstrator.
Not intended: factual QA, arithmetic, or any use requiring correctness. At 144M params with a short run, content is frequently wrong or incoherent even when the format is right.
No safety tuning / RLHF. Outputs may be nonsensical or undesirable. English-centric.

Usage

`transformers` note

This model uses a custom architecture (QuazimotoLM in model.py), so it does not load via AutoModelForCausalLM. Use the repo's model.py + generate.py. The tokenizer is HF-compatible via SpikeTokenizer (spike_tokenizer.py).

Chat generation (recommended for the SFT model)

python generate.py --ckpt chkpt/quazimoto_sft.pt --chat \
    --prompt "Hello, who are you?" --max_new_tokens 200

Reasoning with `<think>`

The SFT mix included short-reasoning data, so the model knows the <think>…</think> format but does not always open the tag on its own. To force it, prime the assistant turn ending in <think>.

Speculative decoding (faster, greedy-identical)

python generate.py --ckpt chkpt/quazimoto_sft.pt --chat --speculative --spec_stats \
    --prompt "Explain the plan." --max_new_tokens 200

Programmatic

import torch
from model import QuazimotoLM, QuazimotoConfig
from generate import load_tokenizer, generate, build_prompt, resolve_stop_ids

tok = load_tokenizer(".")
ck = torch.load("chkpt/quazimoto_sft.pt", map_location="cpu", weights_only=False)
model = QuazimotoLM(QuazimotoConfig(**ck["family_config"]))
model.load_state_dict(ck["model"]); model.eval()

model.reset_ring_memory()               # clear test-time specialist memory per request
ids, _ = build_prompt(tok, "Hello!", chat=True, system="")
x = torch.tensor([ids])
out = generate(model, model.cfg, x, 200, stop_ids=resolve_stop_ids(tok, True))
print(tok.decode(out[0, len(ids):].tolist(), skip_special_tokens=True))

Special tokens

ChatML (<|im_start|>, <|im_end|>, <|system|>, <|user|>, <|assistant|>), reasoning (<think>, </think>, <begin_solution>, <end_solution>), tool calls (<tool_call>, <tool_response>), code FIM (<|fim_prefix|>, <|fim_middle|>, <|fim_suffix|>), and <|endoftext|>.

Statefulness note (ring-specialist test-time memory)

The ring specialists write to persistent stores that carry over across forward calls (store_out norm goes 0.0 → ~44.5 after one forward), so in principle memory from one prompt can leak into the next. At this checkpoint the practical effect is negligible: the specialist output gates barely opened during the short SFT (tanh(scale) in −0.10…+0.17), so cross-prompt contamination perturbs next-token logits by only ~0.007 (max) and did not flip a single greedy token in testing. It is the weakest of all the model's components.

It is still handled correctly for reproducibility and to future-proof checkpoints where those gates open wider: generate.py now calls model.reset_ring_memory() before every prompt (REPL turn / single run). Flags:

--keep_ring_memory — opt out of the per-prompt reset (let memory accumulate across turns).
--freeze_ring_memory — disable online writes entirely for fully stateless, reproducible generation (model.set_ring_memory_writing(False)).

A Gradio wrapper that reuses one loaded model gets clean per-request behavior through the same run_once path; if you call generate() directly, call model.reset_ring_memory() yourself at the start of each request.

Training data

Pretraining blend: ~35% Ultra-FineWeb-L3, 25% FineWeb-Edu, 25% FineMath, 15% Quazim0t0/PretrainNew.
SFT blend (this release): ~equal quarters of mlabonne/ultrachat_200k_sft, mlabonne/ultrafeedback-sft, openbmb/UltraData-SFT-2605 (Knowledge, no-think split), and volcanos/OpenThoughts2-1M-ShortThink (reasoning). Loss is masked to assistant turns (standard instruction SFT).

Files

model.py (architecture), family.py (trait modules), fractal.py (Mandelbrot seeding), generate.py (inference harness), spike_tokenizer.py + tokenizer.json (tokenizer), special_tokens.py (special-token registry), chkpt/quazimoto_sft.pt (this release).

Downloads last month: -; Downloads are not tracked for this model. How to track

Collection including Quazim0t0/Positronic-144M

My Open-Source: Pretrained Models

Collection

My Own Pretrained Models with Average PPL Rating & Scores • 6 items • Updated about 21 hours ago • 1