💬 Join the community

Discord: https://discord.gg/PtuHZDv5ju — builders training their own trading/finance models. Engineering, no signals. Get install help, share work, follow the Qovaryx research devlog. Try the deployed Q-Chat router live via /qchat ask.

Ko-fi: https://ko-fi.com/tjarvis91 — every coffee literally buys GPU time for the next training cycle.

Qovaryx 1B -- Scratch Base (random-init)

Compact AI is not small AI. A 1B-parameter trainable substrate engineered to punch above its weight class on a single consumer GPU. Random-init -- bring your own corpus, train it from scratch. MTP-K=4, GQA, pluggable FFN backends (dense SwiGLU / ternary BitNet-style / sparse low-rank MoE), optional task-specific heads. Apache-2.0.

📖 Read the public research: github.com/thron-j/qovaryx-ai-research -- philosophy, devlog series (14 entries: AI without big data centers, legacy brain crystallization, shell-governed cognition, EVO20 training genome, the first cluster shell, when the proxy breaks, cluster shell V4 diagnostic, more). The architecture choices in this checkpoint are described there. Implementation internals are intentionally withheld.

Compact ≠ small

Frontier-scale models cost a small country's GPU budget to train and a data-center to serve. Most real applications don't need 70B params; they need a focused 1B that does one thing extraordinarily well, fits in 16 GB of consumer VRAM, and stays on the hobbyist/researcher's local hardware -- no API key, no inference bill, no token-rate limit, no provider drift.

The Qovaryx family is built around that thesis. Same component library at 50M / 350M / 1B sizes, all engineered to:

  • Train on a single Blackwell-class consumer GPU (RTX 4080 / 4090 / 5070 Ti / 5080 / 5090). 50M fits in <1 GB; 350M fits at batch=1 grad-accum on 12 GB; 1B fits in 16 GB with bf16 + adamw_8bit.
  • Inference on local hardware -- no provider lock-in. A serious workstation runs the 1B at usable throughput; the 350M runs on a laptop.
  • Pack modern components into the smaller footprint: Multi-Token Prediction, GQA, ternary and sparse-MoE FFN backends, optional task heads. The architectural choices that make 70B models work also make 1B models punch above their weight class.

This repo is the random-init starting point for that research program. No pretraining has occurred -- the model emits noise out of the box. It exists so you can train the architecture on your own tokens, your own task, your own budget, without paying the wall-clock cost of recreating the scaffold.

Think of it as a trainable substrate -- like nanoGPT or the Pythia step-0 branches -- but with a few modern components pre-wired:

  • Multi-Token Prediction (MTP-K=4) heads for jointly predicting up to 4 tokens ahead
  • Grouped-Query Attention (GQA) with configurable n_head / n_kv_head ratio (default 16:4)
  • Pluggable FFN backends: dense SwiGLU, ternary SwiGLU (BitNet-style with straight-through estimator), low-rank SwiGLU, routed low-rank MoE (4 experts top-1)
  • Optional task heads: 4-class decision head, raw-pixel chart-patch encoder (vision prefix tokens) -- switchable via config
  • Custom 20,242-vocab BPE tokenizer -- domain-leaning but broadly reusable
  • Packed mmap shard format for fast training on cheap consumer GPUs (one-time PackOnce compile, then mmap reads instead of per-row BPE)

Trained on a single RTX 5070 Ti (16 GB, Blackwell sm_120) using PyTorch 2.7 + flash-attn 2.7.4 + bnb 0.49.2 (adamw_8bit). 8-bit optimizer + bf16 + length curriculum means a 50M-param sibling fits in <1 GB and a 1B sibling fits in 16 GB at batch=1.


Why build on Qovaryx?

Compact AI is not small AI. Frontier-scale models ask how do we build the biggest intelligence possible? Qovaryx asks the inverse: how much disciplined intelligence can we extract per parameter, per watt, per GPU?

The published *-scratch-base checkpoints are the trainable substrate for that thesis. They are not pre-trained -- they are the random-init starting point, engineered so that one person on one consumer GPU can take the architecture all the way to a focused specialist model without renting a data-center.

Dimension Frontier closed (GPT-5, Claude, Gemini) Frontier open (DeepSeek, Llama, Mistral, Qwen) Qovaryx
Primary philosophy Maximum general intelligence Open-weight general foundation Behavioral compression + corrective intelligence
Infrastructure Multi-datacenter clusters Multi-GPU enterprise / cloud ✅ Single consumer GPU (RTX 4080 / 4090 / 5070 Ti / 5080 / 5090)
Deployment Cloud / API only Cloud or local (≥1x A100-class at the larger sizes) ✅ Local-first, fits in 16 GB VRAM at every size
Cost model Very high compute + ongoing API spend Moderate-high compute, lower at inference ✅ Consumer-grade -- power bill + GPU you already own
License Closed weights, ToS-gated Open weights (license varies) ✅ Apache-2.0 weights + Apache-2.0 reference trainer
Behavioral control Mostly emergent / safety-layer Fine-tune dependent ✅ Deterministic shell + crystal governance -- explicit, not emergent
Specialization strategy One giant universal model General foundation, fine-tune downstream ✅ Modular specialists composed via the same compact base
Confidence handling Opaque token probabilities Token probabilities ✅ Calibrated 4-class decision head (action-gate-style classifier, optional)
Multi-token prediction Generally next-token only Generally next-token only ✅ MTP-K=4 built in (4-tokens-ahead joint head)
FFN options Dense Dense or MoE (frontier sizes) ✅ Pluggable: dense SwiGLU / ternary BitNet-style / sparse low-rank MoE -- config flag
Attention MHA / GQA GQA ✅ GQA with configurable n_head:n_kv_head ratio
Training tokenizer Provider-controlled Provider-controlled ✅ You bundle it (20,242-vocab BPE shipped; replaceable)
Vision input Provider plugin Provider plugin ✅ Optional raw-pixel chart-patch encoder -- switchable per-row at train time

✅ = something Qovaryx provides out of the box on the scratch-base release.

This is not a claim that Qovaryx beats GPT-5 on MMLU. It will not. It is a claim that the right shape of small can do real work where the right shape of huge is unavailable, unaffordable, or unowned.

Why this base helps you build

  • The components are already wired -- MTP-K, GQA, decision head, ternary/MoE FFN backends, chart patch encoder. Switchable via config. Skip three months of architecture work.
  • It fits -- 50M fits anywhere; 350M fits on a 12 GB card; 1B fits on a 16 GB consumer card with adamw_8bit + bf16. You can actually train these on hardware you can actually buy.
  • It's honest about what's withheld -- the architecture is open. The crystallization recipes, eval gold, verifier internals, and shell logic stay private. You build on Qovaryx's substrate; we don't pretend you're getting the whole stack.
  • Apache-2.0 -- research, hobby, commercial. Attribution appreciated, not legally required.

Qovaryx is NOT trying to be

  • A frontier-IQ replacement
  • A benchmark champion on broad evals
  • A chat product
  • A substitute for engineering on the wrapper / verifier / shell -- those are where compact AI earns its keep

Sizes in this family -- consumer-GPU first

Repo Params d_model n_layer n_head n_kv_head d_ff VRAM @ training (bf16, adamw_8bit) VRAM @ inference (bf16)
tjarvis91/qovaryx-50m-scratch-base ~47M 512 12 8 2 1408 <1 GB <0.5 GB
tjarvis91/qovaryx-350m-scratch-base ~352M 1024 24 16 4 2816 ~3 GB ~1.5 GB
tjarvis91/qovaryx-1b-scratch-base ~1.05B 2048 22 16 4 5504 ~12 GB ~3 GB

All three share the same component library and tokenizer -- pick the size your GPU can hold. You do not need an A100 to train these. A 16 GB consumer card handles every size in this family. A 12 GB card handles 50m + 350m comfortably. A 24 GB card lets you push 1B with larger batches.


TL;DR -- what's in this repo

File Purpose
config.json Architecture spec (DecoderConfig) -- d_model, n_layer, FFN kind, MTP-K, GQA ratio, vocab, max_seq_len
pytorch_model.bin Random-init weights (Glorot/Xavier per layer kind), bf16
tokenizer.json 20,242-vocab BPE (custom; domain-leaning but general-purpose)
tokenizer_config.json Tokenizer wrapping config
generation_config.json Default sampling params
modeling_qovaryx.py FinanceDecoder class (named for legacy reasons; the class is task-agnostic) + heads + FFN backends
train_quickstart.py A nanoGPT-style 200-line training loop you can run today
README.md This card

The model uses trust_remote_code=True (custom architecture). Load it like any other HF model.


Quickstart

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("tjarvis91/qovaryx-1b-scratch-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "tjarvis91/qovaryx-1b-scratch-base",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).cuda()

# Out-of-the-box this generates noise -- model is random-init by design.
# Train it on your own corpus, then it will be useful.
out = model.generate(tok("hello", return_tensors="pt").input_ids.cuda(), max_new_tokens=20)
print(tok.decode(out[0]))

Minimal training loop (single GPU, bf16, AdamW):

import torch
from torch.utils.data import DataLoader

opt = torch.optim.AdamW(model.parameters(), lr=2e-4, weight_decay=0.1, betas=(0.9, 0.95))
for step, batch in enumerate(your_dataloader):
    batch = {k: v.cuda() for k, v in batch.items()}
    with torch.amp.autocast("cuda", dtype=torch.bfloat16):
        out = model(**batch, labels=batch["input_ids"])
    out.loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    opt.step(); opt.zero_grad()
    if step % 10 == 0:
        print(f"step={step} loss={out.loss.item():.4f}")

A full reference recipe (length curriculum + MTP-K + decision-head + packed shards + adamw_8bit for 16 GB cards) is in train_quickstart.py.


FFN backends -- switchable via config

Set ffn_kind in config.json (or via from_pretrained(..., ffn_kind=...)):

ffn_kind Description When to use
swiglu Dense SwiGLU (the obvious baseline) Default. Fastest wall-clock per step.
ternary_swiglu BitNet-style ternary weights with straight-through estimator When you care about deployable model size and accept ~3x slower training
lowrank_swiglu Factorized projections (rank ffn_rank) Param compression without sparsity
routed_lowrank_swiglu Sparse MoE: ffn_experts top-ffn_top_k routing When you want capacity without dense FLOPs

These are inspired by published work (BitNet, DeepSeek-V3 MTP, Mixtral, GShard, ST-MoE). The novelty here is that all four share one trainer, one tokenizer, and one packed-shard pipeline -- so switching backends is a config edit, not a fork.


Optional task heads

The base architecture exposes two opt-in heads, off by default:

  • decision_head_enabled -- 4-class classification head pooled at a chosen token position. Useful for downstream policy / preference / structured-action tasks. Co-trained via masked CE.
  • chart_patch_encoder_enabled -- strided-Conv2d raw-pixel encoder that converts an input image into prefix tokens, fed into the causal decoder before the text tokens. Useful for any text+image task; not specific to charts despite the name.

Both can be turned on per-row at training time (the trainer reads per-example metadata), so you can mix unimodal and multimodal rows in the same shard. Both are random-init in this repo and need to be trained alongside the LM head if you use them.


Suggested training recipes

These are starting points -- tune to your data. Single 5070 Ti / RTX 4080-class GPU assumed.

50M baseline (LM only)

target_tokens:           500M-2B
tokens_per_batch:        4096
grad_accum_steps:        8
max_seq_len:             2048
length_curriculum:       (512,1000)(1024,3000)(2048,10000)(4096,-1)
lr:                      2e-4
warmup_steps:            500
weight_decay:            0.1
optimizer:               adamw_8bit (bf16)
attn_backend:            flash (FA2 if available, else PyTorch SDPA)
ffn_kind:                swiglu
mtp_weight:              0.3

350M with MTP + decision head

target_tokens:           5B-20B
tokens_per_batch:        8192
grad_accum_steps:        16
max_seq_len:             4096
ffn_kind:                ternary_swiglu  (or swiglu)
mtp_weight:              0.3
decision_weight:         0.5
class_weighted_decision: true
calibration_loss_weight: 0.2  (if you want a confidence-calibrated head)

1B with sparse MoE

target_tokens:           50B-200B
ffn_kind:                routed_lowrank_swiglu
ffn_rank:                128
ffn_experts:             4
ffn_top_k:               1
mixed_precision:         bf16
optimizer:               adamw_8bit

What this is NOT

  • Not a pretrained model. Out-of-the-box outputs are noise. Random initialization is the entire point.
  • Not finance-specific despite the legacy class name FinanceDecoder. The architecture is task-agnostic; the BPE tokenizer leans toward finance-aware merges but works on any English text.
  • Not a drop-in replacement for Llama / Qwen / Mistral. The component set is different (MTP-K heads in particular need their own training term).
  • Not adversarially robust. It's a substrate.
  • Not a tiny / toy model. 1B params at bf16 hits 2 GB on disk; trained well, it competes seriously on focused tasks. "Compact" means efficient, not weak.

License

Apache-2.0. Use it for research, commercial work, hobby projects -- whatever. Attribution appreciated but not legally required.


Research notes

Qovaryx is part of a broader local-sovereign-AI research program. Higher-level framings, architectural rationale, and ablation studies are published progressively at:

Real training runs on this architecture -- the Cluster Shell V1 audit

This scratch-base is the trainable substrate for the Cluster Shell committee architecture described in the Qovaryx research devlog. The V1 readiness gate, run on the actual trained specialist heads, looked like this:

Specialist Train rows Majority baseline Linear baseline GBDT baseline Gate verdict
Q-Penny 150K 52.90% 73.03% 73.84% PASS
Q-Veto 150K 57.57% 72.23% 79.93% PASS
Q-Router 150K 24.00% 76.54% 84.62% PASS
Q-2yr 300K 50.04% 75.38% 75.93% PASS
Q-180d 300K 50.09% 74.46% 74.95% PASS

Five specialists, deterministic 5% holdouts, each at least +20pp over the majority-class floor. The architecture clears its falsifiability gate on fresh data -- what makes that gate honest is documented in evaluation discipline and when the proxy breaks.

Current state: a V4 diagnostic experiment is queued to discriminate between two competing hypotheses on why two of the second-generation specialists stalled at the data ceiling. The live journal: Cluster Shell V4 -- Diagnostic.

Research index: https://github.com/thron-j/qovaryx-ai-research

Implementation details, training corpora, and certain ablation specifics are intentionally withheld in the public devlog. The framings are publishable; the internals are not. Collaboration inquiries: jeherizonllc@gmail.com.


Support

If this base helps you build something, support continued development:

☕ ko-fi.com/tjarvis91

Every contribution funds GPU time and the next-generation Qovaryx training runs.


Sibling models in this lineage


Citation

@misc{qovaryx-scratch-base-2026,
  title     = {Qovaryx: A Compact Decoder Architecture with Multi-Token Prediction, GQA, and Pluggable FFN Backends},
  author    = {Jarvis, Thomas},
  year      = {2026},
  month     = {May},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/tjarvis91/qovaryx-1b-scratch-base}
}

Status

Random-init checkpoint as of 2026-05-22. Future updates will add trained sibling repos with downstream task heads enabled (decision head + chart-patch encoder variants). Watch the org page for new releases.

Downloads last month
31
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support