You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Mycel-LM (79M)

Model will be ungated for open download once I am done with the base..

Mycel-LM is a 79.2M-parameter research language model whose channel-mixing block is not an MLP. It is a differentiable Neighbour-Sensing fungal-colony-growth model: each token is expanded into a colony of hyphal tips that grow in a bounded latent region, sense a shared density field, and steer their own growth — the "MLP" is replaced by a few differentiable steps of colony growth, read back out into the hidden state.

It is part of a family of models that ask a single question: can the generalizing ability of a transformer be carried by an unusual, self-organizing dynamical system in place of the feed-forward block? Mycel-LM keeps the family's tokenizer, traits, and data fixed and swaps only the mixer, so it is a controlled experiment against the sibling Quazimoto models (whose mixer is a bank of coupled Kuramoto oscillators).

⚠️ Research artifact, not a product. At ~79M parameters it is fluent but small: it models the shape of language well and generates coherent, grammatical text, but it is not factual and will confidently hallucinate. See Limitations.


Table of contents


Highlights

  • Novel mixer. The per-layer feed-forward block is replaced by a MycelBlock — a differentiable simulation of fungal colony growth (Neighbour-Sensing).
  • Self-describing checkpoints. Each .pt embeds a family_config recording the exact geometry, so generate.py / healthcheck.py / visualize.py rebuild the model with no external config.
  • KV cache. Incremental decoding is wired through the whole stack (attention presents are threaded per layer); generate() prefills the prompt once and decodes one token per forward.
  • Self-speculative decoding. Four MTP draft heads propose the next tokens and the main head verifies them in one parallel forward — bit-identical to greedy, just fewer forwards.
  • Live 3-D visualizer. Watch the colony grow token-by-token as a Three.js filament web.

Architecture

Standard causal Transformer backbone (token-mixing = attention, tied LM head), with the per-layer feed-forward network replaced by a MycelBlock.

The MycelBlock (the novel part)

Based on the Meškauskas / Fricker / Moore (2004) Neighbour-Sensing model of fungal colony growth:

  1. The hidden state projects to N = 96 hyphal tips per token, each with a position in a bounded 3-D latent region and a growth vector.
  2. A few differentiable growth steps run: each tip senses the local density field, steers away from the colony's own density (negative autotropism) with persistence, moves, and is re-clamped into the bounded region (the colony can't grow unbounded).
  3. The final [position, growth-vector, sensed-density] of every tip is read out back into the hidden state, behind a family gate.

The density field is evaluated against 16 learnable field centres (a low-rank sample of the field) so cost is O(N·F) per step, not O(N²) — the same mean-field trick that keeps the sibling oscillator block cheap. Health-checking a trained checkpoint shows the tropism parameter converges strongly negative across layers, i.e. the model genuinely learns the grow-away-from-density behaviour rather than leaving it at init.

Trait stations (MycelStations): tiny memory specialists sit at fixed anchor positions in the colony. A tip interacts with a station by proximity — which is emergent from where the tip grew — so which tips use which trait "comes to be" during growth rather than being assigned to a fixed index. The stations hold test-time-writable input/output stores that act as an addressable context memory at inference.

Attention

Family attention ported from the Quazimoto v2 stack:

  • MLA low-rank Q/O projections
  • Partial RoPE (nope + rope split), QK-Norm, GQA (4 KV heads)
  • optional DERF (erf attention) and XSA (value-subspace removal) — off in this checkpoint
  • KV cache for incremental decoding (per-layer (k, v) presents threaded through the stack)

Opt-in family traits (all live in this checkpoint)

Trait Role
HRM iterative gated hidden-state refinement (random init state, gates open)
MoE SwiGLU mixture (4 routed + 1 shared, top-2) refining the trunk
MTP (×4) multi-token-prediction draft heads → enables self-speculative decoding
JEPA representation-prediction aux loss (train-only; never runs at inference)
Ring Specialists (7/ring) the trait stations described above
Fractal Phase Seed seeds tip positions from each token's Mandelbrot orbit angles (gated)

Config (this checkpoint)

params 79.2M
layers 10
d_model 768
heads 12 (4 KV)
vocab 16512 (SpikeWhale byte-merge)
block size 2048
tips / token 96, in a 3-D bounded colony
field centres 16 · growth steps 3 · stations 16

The checkpoint is self-describing: family_config inside the .pt records the exact geometry so the model rebuilds itself on load.


Repository layout

model.py               QuazimotoLM + QuazimotoConfig — the transformer backbone, attention,
                       KV cache, traits (HRM/MoE/MTP/JEPA), generate() and forward_drafts()
mycel.py               MycelBlock (Neighbour-Sensing growth mixer) + MycelStations
family.py              shared family layers (MoE, HRM, specialists, norms, ...)
fractal.py             hierarchical Mandelbrot phase seeding (FractalSeed trait)
instrument.py          zero-cost capture hooks the visualizer reads from
special_tokens.py      ChatML / control-token definitions
spike_tokenizer.py     SpikeWhale byte-merge tokenizer (subclasses PreTrainedTokenizer)
tokenizer.json         the tokenizer vocab / merges (vocab 16,512)
fractal_phase.pt       precomputed hierarchical Mandelbrot phase table (regenerable)

generate.py            inference harness — KV cache + self-speculative decoding + sampling
healthcheck.py         per-layer weight / gate / PPL diagnostics for a checkpoint
visualize.py           builds the 3-D colony dashboard (viz.html) from a generation

train.py               pretraining entry point (streamed multi-corpus blend)
train_sft.py           supervised fine-tuning (ChatML, assistant-only loss masking)
chat_sft.py            chat-format rendering / loss masking helpers used by SFT
train_opd.py           OPD (on-policy distillation) training loop
distill_uld.py         universal-logit-distillation utilities
opd_teacher.py         teacher wrapper for distillation
build_fractal_table.py regenerates fractal_phase.pt
train.bat / train_sft.bat   Windows convenience launchers

chkpt/quazimoto.pt       pretraining checkpoint (step 149,000)
chkpt/quazimoto_sft.pt   SFT checkpoint (step 4,000, ChatML)

Note: the Modal cloud launchers (modal_train.py, modal_sft.py) are intentionally not part of this package. The scripts above run locally on CPU or a single GPU.


Install

pip install -r requirements.txt

Requirements are minimal: torch, numpy, transformers (the tokenizer subclasses PreTrainedTokenizer). Training additionally uses datasets and huggingface_hub. Everything below runs on CPU (slow but functional) or a single GPU.


Quickstart

import torch
from model import QuazimotoLM, QuazimotoConfig
from spike_tokenizer import SpikeTokenizer

ck  = torch.load("chkpt/quazimoto.pt", map_location="cpu", weights_only=False)
cfg = QuazimotoConfig(**ck["family_config"])          # self-describing
model = QuazimotoLM(cfg); model.load_state_dict(ck["model"], strict=False); model.eval()
tok = SpikeTokenizer(vocab_file="tokenizer.json")

ids = torch.tensor([tok.encode("The mycelium spreads through the soil", add_special_tokens=False)])
out = model.generate(ids, n_new=80, temperature=0.8, top_k=40)   # KV cache on by default
print(tok.decode(out[0].tolist(), skip_special_tokens=True))

For a chat turn, wrap the prompt in ChatML and stop on <|im_end|> (the SFT checkpoint was trained on this framing):

prompt = "<|im_start|><|user|>\nWhat is mycelium?<|im_end|>\n<|im_start|><|assistant|>\n"
ids = torch.tensor([tok.encode(prompt, add_special_tokens=False)])
out = model.generate(ids, n_new=120, temperature=0.7, top_k=40)

Command-line usage

# plain completion (KV cache on by default)
python generate.py --ckpt chkpt/quazimoto.pt --prompt "In the beginning" --max_new_tokens 80

# chat turn (ChatML framing + stop on <|im_end|>)
python generate.py --ckpt chkpt/quazimoto_sft.pt --chat --prompt "Hello, who are you?"

# interactive REPL
python generate.py --ckpt chkpt/quazimoto_sft.pt --interactive

# self-speculative decoding (MTP heads draft, main head verifies; report acceptance)
python generate.py --ckpt chkpt/quazimoto.pt --speculative --spec_stats

# disable the KV cache (full recompute each step — for comparison)
python generate.py --ckpt chkpt/quazimoto.pt --no_cache

# per-layer diagnostics (weights / gates / PPL)
python healthcheck.py --ckpt chkpt/quazimoto.pt

Sampling knobs: --temperature, --top_k, --top_p, --repetition_penalty, --seed.


Live visualizer & Space

visualize.py renders the colony growing in 3-D as the model generates, token by token — hyphal tips linked into a filament web, coloured by local density, with the trait stations shown as orange wire-spheres. It writes a self-contained viz.html (Three.js from a CDN):

python visualize.py --ckpt chkpt/quazimoto_sft.pt --prompt "the mycelium spreads" --tokens 50

A companion Hugging Face Space (Mycel-LM v1) wraps the same architecture in an interactive chat — KV-cache decoding drives the reply while the 3-D colony visualizer animates the growth for the generated tokens.


Training from scratch

python train.py --device cuda --steps 160000 --batch 12 --block 2048 --amp \
    --use-hrm --use-moe --use-mtp --use-jepa --use-ring-specialists --use-fractal-phase-seed \
    --stream --math-frac 0.25 --out chkpt/quazimoto.pt --ckpt-every 500 --resume
  • Tokenizer: SpikeWhale byte-merge, vocab 16,512. (Byte-merge perplexity is tokenizer-inflated; bits/byte is the honest metric.)
  • Pretraining blend: 35% Ultra-FineWeb-L3 / 25% FineWeb-Edu / 25% FineMath / 15% Quazim0t0/PretrainNew, streamed. Streamed datasets are pulled with datasets; gated corpora need huggingface-cli login.
  • --resume continues from the checkpoint at --out. The growth loop is activation-heavy, so keep the batch modest; --amp gives a bf16 speedup on GPU.

Pass --help to train.py for the full trait / optimiser / schedule surface.

Fine-tuning (SFT)

python train_sft.py --init chkpt/quazimoto.pt --out chkpt/quazimoto_sft.pt \
    --steps 4000 --batch 8 --block 2048 --amp
  • Renders a chat mix in ChatML with assistant-only loss masking (chat_sft.py).
  • SFT blend: ultrachat_200k_sft + ultrafeedback-sft + UltraData-SFT-2605/Knowledge + OpenThoughts2-1M-ShortThink.
  • The bundled SFT checkpoint is only 4k steps — the chat format transferred but the model is still shallow.

The distributed checkpoints carry weights only (optimizer state stripped to keep the download small). Fine-tuning starts a fresh optimizer from them, which is the normal path; only exact resumption of the original pretraining run would need the optimizer state.


Checkpoints

  • chkpt/quazimoto.ptpretraining checkpoint, step 149,000
  • chkpt/quazimoto_sft.ptSFT checkpoint, step 4,000 (ChatML, early)

Both embed family_config (self-describing) and load with strict=False so future trait additions stay backward-compatible.


Limitations

  • Not factual. Small-model behaviour: fluent and grammatical, but it invents facts ("the capital of France is the largest and most important part of the world").
  • SFT is early (4k steps) — answers follow the chat format but hallucinate.
  • No safety tuning. No RLHF/guardrails; do not deploy in user-facing settings.
  • Custom architecture — cannot be loaded with AutoModel; use the bundled model.py.
  • This is an experiment in architecture, released to study whether a self-organizing growth process can carry a transformer's generalization. Treat outputs accordingly.

Citation / basis

Neighbour-Sensing model of hyphal growth: Meškauskas, Fricker & Moore (2004), Simulating colonial growth of fungi with the Neighbour-Sensing model of hyphal growth, Mycological Research 108(11).

License: Apache-2.0.

Citation

If you use this model, please cite:

@misc{mycellm79m,
  title        = {Mycel-LM-79M: A ~79M-parameter Neighbour-Sensing fungal-colony language model},
  author       = {Dean Byrne (Quazim0t0)},
  year         = {2026},
  howpublished = {HuggingFace, \url{https://huggingface.co/Quazim0t0/Mycel-LM-79M}},
  note         = {Quazim0t0/Mycel-LM-79M}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Quazim0t0/Mycel-LM-79M