Instructions to use tjarvis91/qovaryx-1b-scratch-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use tjarvis91/qovaryx-1b-scratch-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="tjarvis91/qovaryx-1b-scratch-base", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("tjarvis91/qovaryx-1b-scratch-base", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use tjarvis91/qovaryx-1b-scratch-base with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "tjarvis91/qovaryx-1b-scratch-base" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tjarvis91/qovaryx-1b-scratch-base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/tjarvis91/qovaryx-1b-scratch-base
- SGLang
How to use tjarvis91/qovaryx-1b-scratch-base with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "tjarvis91/qovaryx-1b-scratch-base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tjarvis91/qovaryx-1b-scratch-base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "tjarvis91/qovaryx-1b-scratch-base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tjarvis91/qovaryx-1b-scratch-base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use tjarvis91/qovaryx-1b-scratch-base with Docker Model Runner:
docker model run hf.co/tjarvis91/qovaryx-1b-scratch-base
- Qovaryx 1B -- Scratch Base (random-init)
- Compact ≠small
- Why build on Qovaryx?
- Sizes in this family -- consumer-GPU first
- TL;DR -- what's in this repo
- Quickstart
- FFN backends -- switchable via config
- Optional task heads
- Suggested training recipes
- What this is NOT
- License
- Research notes
- Real training runs on this architecture -- the Cluster Shell V1 audit
- Support
- Sibling models in this lineage
- Citation
- Status
- Compact ≠small
💬 Join the community
Discord: https://discord.gg/PtuHZDv5ju — builders training their own trading/finance models. Engineering, no signals. Get install help, share work, follow the Qovaryx research devlog. Try the deployed Q-Chat router live via
/qchat ask.Ko-fi: https://ko-fi.com/tjarvis91 — every coffee literally buys GPU time for the next training cycle.
Qovaryx 1B -- Scratch Base (random-init)
Compact AI is not small AI. A 1B-parameter trainable substrate engineered to punch above its weight class on a single consumer GPU. Random-init -- bring your own corpus, train it from scratch. MTP-K=4, GQA, pluggable FFN backends (dense SwiGLU / ternary BitNet-style / sparse low-rank MoE), optional task-specific heads. Apache-2.0.
📖 Read the public research: github.com/thron-j/qovaryx-ai-research -- philosophy, devlog series (14 entries: AI without big data centers, legacy brain crystallization, shell-governed cognition, EVO20 training genome, the first cluster shell, when the proxy breaks, cluster shell V4 diagnostic, more). The architecture choices in this checkpoint are described there. Implementation internals are intentionally withheld.
Compact ≠small
Frontier-scale models cost a small country's GPU budget to train and a data-center to serve. Most real applications don't need 70B params; they need a focused 1B that does one thing extraordinarily well, fits in 16 GB of consumer VRAM, and stays on the hobbyist/researcher's local hardware -- no API key, no inference bill, no token-rate limit, no provider drift.
The Qovaryx family is built around that thesis. Same component library at 50M / 350M / 1B sizes, all engineered to:
- Train on a single Blackwell-class consumer GPU (RTX 4080 / 4090 / 5070 Ti / 5080 / 5090). 50M fits in <1 GB; 350M fits at batch=1 grad-accum on 12 GB; 1B fits in 16 GB with bf16 +
adamw_8bit. - Inference on local hardware -- no provider lock-in. A serious workstation runs the 1B at usable throughput; the 350M runs on a laptop.
- Pack modern components into the smaller footprint: Multi-Token Prediction, GQA, ternary and sparse-MoE FFN backends, optional task heads. The architectural choices that make 70B models work also make 1B models punch above their weight class.
This repo is the random-init starting point for that research program. No pretraining has occurred -- the model emits noise out of the box. It exists so you can train the architecture on your own tokens, your own task, your own budget, without paying the wall-clock cost of recreating the scaffold.
Think of it as a trainable substrate -- like nanoGPT or the Pythia step-0 branches -- but with a few modern components pre-wired:
- Multi-Token Prediction (MTP-K=4) heads for jointly predicting up to 4 tokens ahead
- Grouped-Query Attention (GQA) with configurable
n_head/n_kv_headratio (default 16:4) - Pluggable FFN backends: dense SwiGLU, ternary SwiGLU (BitNet-style with straight-through estimator), low-rank SwiGLU, routed low-rank MoE (4 experts top-1)
- Optional task heads: 4-class decision head, raw-pixel chart-patch encoder (vision prefix tokens) -- switchable via config
- Custom 20,242-vocab BPE tokenizer -- domain-leaning but broadly reusable
- Packed mmap shard format for fast training on cheap consumer GPUs (one-time PackOnce compile, then mmap reads instead of per-row BPE)
Trained on a single RTX 5070 Ti (16 GB, Blackwell sm_120) using PyTorch 2.7 + flash-attn 2.7.4 + bnb 0.49.2 (adamw_8bit). 8-bit optimizer + bf16 + length curriculum means a 50M-param sibling fits in <1 GB and a 1B sibling fits in 16 GB at batch=1.
Why build on Qovaryx?
Compact AI is not small AI. Frontier-scale models ask how do we build the biggest intelligence possible? Qovaryx asks the inverse: how much disciplined intelligence can we extract per parameter, per watt, per GPU?
The published *-scratch-base checkpoints are the trainable substrate for that thesis. They are not pre-trained -- they are the random-init starting point, engineered so that one person on one consumer GPU can take the architecture all the way to a focused specialist model without renting a data-center.
| Dimension | Frontier closed (GPT-5, Claude, Gemini) | Frontier open (DeepSeek, Llama, Mistral, Qwen) | Qovaryx |
|---|---|---|---|
| Primary philosophy | Maximum general intelligence | Open-weight general foundation | Behavioral compression + corrective intelligence |
| Infrastructure | Multi-datacenter clusters | Multi-GPU enterprise / cloud | ✅ Single consumer GPU (RTX 4080 / 4090 / 5070 Ti / 5080 / 5090) |
| Deployment | Cloud / API only | Cloud or local (≥1x A100-class at the larger sizes) | ✅ Local-first, fits in 16 GB VRAM at every size |
| Cost model | Very high compute + ongoing API spend | Moderate-high compute, lower at inference | ✅ Consumer-grade -- power bill + GPU you already own |
| License | Closed weights, ToS-gated | Open weights (license varies) | ✅ Apache-2.0 weights + Apache-2.0 reference trainer |
| Behavioral control | Mostly emergent / safety-layer | Fine-tune dependent | ✅ Deterministic shell + crystal governance -- explicit, not emergent |
| Specialization strategy | One giant universal model | General foundation, fine-tune downstream | ✅ Modular specialists composed via the same compact base |
| Confidence handling | Opaque token probabilities | Token probabilities | ✅ Calibrated 4-class decision head (action-gate-style classifier, optional) |
| Multi-token prediction | Generally next-token only | Generally next-token only | ✅ MTP-K=4 built in (4-tokens-ahead joint head) |
| FFN options | Dense | Dense or MoE (frontier sizes) | ✅ Pluggable: dense SwiGLU / ternary BitNet-style / sparse low-rank MoE -- config flag |
| Attention | MHA / GQA | GQA | ✅ GQA with configurable n_head:n_kv_head ratio |
| Training tokenizer | Provider-controlled | Provider-controlled | ✅ You bundle it (20,242-vocab BPE shipped; replaceable) |
| Vision input | Provider plugin | Provider plugin | ✅ Optional raw-pixel chart-patch encoder -- switchable per-row at train time |
✅ = something Qovaryx provides out of the box on the scratch-base release.
This is not a claim that Qovaryx beats GPT-5 on MMLU. It will not. It is a claim that the right shape of small can do real work where the right shape of huge is unavailable, unaffordable, or unowned.
Why this base helps you build
- The components are already wired -- MTP-K, GQA, decision head, ternary/MoE FFN backends, chart patch encoder. Switchable via config. Skip three months of architecture work.
- It fits -- 50M fits anywhere; 350M fits on a 12 GB card; 1B fits on a 16 GB consumer card with
adamw_8bit+ bf16. You can actually train these on hardware you can actually buy. - It's honest about what's withheld -- the architecture is open. The crystallization recipes, eval gold, verifier internals, and shell logic stay private. You build on Qovaryx's substrate; we don't pretend you're getting the whole stack.
- Apache-2.0 -- research, hobby, commercial. Attribution appreciated, not legally required.
Qovaryx is NOT trying to be
- A frontier-IQ replacement
- A benchmark champion on broad evals
- A chat product
- A substitute for engineering on the wrapper / verifier / shell -- those are where compact AI earns its keep
Sizes in this family -- consumer-GPU first
| Repo | Params | d_model | n_layer | n_head | n_kv_head | d_ff | VRAM @ training (bf16, adamw_8bit) | VRAM @ inference (bf16) |
|---|---|---|---|---|---|---|---|---|
tjarvis91/qovaryx-50m-scratch-base |
~47M | 512 | 12 | 8 | 2 | 1408 | <1 GB | <0.5 GB |
tjarvis91/qovaryx-350m-scratch-base |
~352M | 1024 | 24 | 16 | 4 | 2816 | ~3 GB | ~1.5 GB |
tjarvis91/qovaryx-1b-scratch-base |
~1.05B | 2048 | 22 | 16 | 4 | 5504 | ~12 GB | ~3 GB |
All three share the same component library and tokenizer -- pick the size your GPU can hold. You do not need an A100 to train these. A 16 GB consumer card handles every size in this family. A 12 GB card handles 50m + 350m comfortably. A 24 GB card lets you push 1B with larger batches.
TL;DR -- what's in this repo
| File | Purpose |
|---|---|
config.json |
Architecture spec (DecoderConfig) -- d_model, n_layer, FFN kind, MTP-K, GQA ratio, vocab, max_seq_len |
pytorch_model.bin |
Random-init weights (Glorot/Xavier per layer kind), bf16 |
tokenizer.json |
20,242-vocab BPE (custom; domain-leaning but general-purpose) |
tokenizer_config.json |
Tokenizer wrapping config |
generation_config.json |
Default sampling params |
modeling_qovaryx.py |
FinanceDecoder class (named for legacy reasons; the class is task-agnostic) + heads + FFN backends |
train_quickstart.py |
A nanoGPT-style 200-line training loop you can run today |
README.md |
This card |
The model uses trust_remote_code=True (custom architecture). Load it like any other HF model.
Quickstart
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("tjarvis91/qovaryx-1b-scratch-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"tjarvis91/qovaryx-1b-scratch-base",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
).cuda()
# Out-of-the-box this generates noise -- model is random-init by design.
# Train it on your own corpus, then it will be useful.
out = model.generate(tok("hello", return_tensors="pt").input_ids.cuda(), max_new_tokens=20)
print(tok.decode(out[0]))
Minimal training loop (single GPU, bf16, AdamW):
import torch
from torch.utils.data import DataLoader
opt = torch.optim.AdamW(model.parameters(), lr=2e-4, weight_decay=0.1, betas=(0.9, 0.95))
for step, batch in enumerate(your_dataloader):
batch = {k: v.cuda() for k, v in batch.items()}
with torch.amp.autocast("cuda", dtype=torch.bfloat16):
out = model(**batch, labels=batch["input_ids"])
out.loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
opt.step(); opt.zero_grad()
if step % 10 == 0:
print(f"step={step} loss={out.loss.item():.4f}")
A full reference recipe (length curriculum + MTP-K + decision-head + packed shards + adamw_8bit for 16 GB cards) is in train_quickstart.py.
FFN backends -- switchable via config
Set ffn_kind in config.json (or via from_pretrained(..., ffn_kind=...)):
ffn_kind |
Description | When to use |
|---|---|---|
swiglu |
Dense SwiGLU (the obvious baseline) | Default. Fastest wall-clock per step. |
ternary_swiglu |
BitNet-style ternary weights with straight-through estimator | When you care about deployable model size and accept ~3x slower training |
lowrank_swiglu |
Factorized projections (rank ffn_rank) |
Param compression without sparsity |
routed_lowrank_swiglu |
Sparse MoE: ffn_experts top-ffn_top_k routing |
When you want capacity without dense FLOPs |
These are inspired by published work (BitNet, DeepSeek-V3 MTP, Mixtral, GShard, ST-MoE). The novelty here is that all four share one trainer, one tokenizer, and one packed-shard pipeline -- so switching backends is a config edit, not a fork.
Optional task heads
The base architecture exposes two opt-in heads, off by default:
decision_head_enabled-- 4-class classification head pooled at a chosen token position. Useful for downstream policy / preference / structured-action tasks. Co-trained via masked CE.chart_patch_encoder_enabled-- strided-Conv2d raw-pixel encoder that converts an input image into prefix tokens, fed into the causal decoder before the text tokens. Useful for any text+image task; not specific to charts despite the name.
Both can be turned on per-row at training time (the trainer reads per-example metadata), so you can mix unimodal and multimodal rows in the same shard. Both are random-init in this repo and need to be trained alongside the LM head if you use them.
Suggested training recipes
These are starting points -- tune to your data. Single 5070 Ti / RTX 4080-class GPU assumed.
50M baseline (LM only)
target_tokens: 500M-2B
tokens_per_batch: 4096
grad_accum_steps: 8
max_seq_len: 2048
length_curriculum: (512,1000)(1024,3000)(2048,10000)(4096,-1)
lr: 2e-4
warmup_steps: 500
weight_decay: 0.1
optimizer: adamw_8bit (bf16)
attn_backend: flash (FA2 if available, else PyTorch SDPA)
ffn_kind: swiglu
mtp_weight: 0.3
350M with MTP + decision head
target_tokens: 5B-20B
tokens_per_batch: 8192
grad_accum_steps: 16
max_seq_len: 4096
ffn_kind: ternary_swiglu (or swiglu)
mtp_weight: 0.3
decision_weight: 0.5
class_weighted_decision: true
calibration_loss_weight: 0.2 (if you want a confidence-calibrated head)
1B with sparse MoE
target_tokens: 50B-200B
ffn_kind: routed_lowrank_swiglu
ffn_rank: 128
ffn_experts: 4
ffn_top_k: 1
mixed_precision: bf16
optimizer: adamw_8bit
What this is NOT
- Not a pretrained model. Out-of-the-box outputs are noise. Random initialization is the entire point.
- Not finance-specific despite the legacy class name
FinanceDecoder. The architecture is task-agnostic; the BPE tokenizer leans toward finance-aware merges but works on any English text. - Not a drop-in replacement for Llama / Qwen / Mistral. The component set is different (MTP-K heads in particular need their own training term).
- Not adversarially robust. It's a substrate.
- Not a tiny / toy model. 1B params at bf16 hits 2 GB on disk; trained well, it competes seriously on focused tasks. "Compact" means efficient, not weak.
License
Apache-2.0. Use it for research, commercial work, hobby projects -- whatever. Attribution appreciated but not legally required.
Research notes
Qovaryx is part of a broader local-sovereign-AI research program. Higher-level framings, architectural rationale, and ablation studies are published progressively at:
Real training runs on this architecture -- the Cluster Shell V1 audit
This scratch-base is the trainable substrate for the Cluster Shell committee architecture described in the Qovaryx research devlog. The V1 readiness gate, run on the actual trained specialist heads, looked like this:
| Specialist | Train rows | Majority baseline | Linear baseline | GBDT baseline | Gate verdict |
|---|---|---|---|---|---|
| Q-Penny | 150K | 52.90% | 73.03% | 73.84% | PASS |
| Q-Veto | 150K | 57.57% | 72.23% | 79.93% | PASS |
| Q-Router | 150K | 24.00% | 76.54% | 84.62% | PASS |
| Q-2yr | 300K | 50.04% | 75.38% | 75.93% | PASS |
| Q-180d | 300K | 50.09% | 74.46% | 74.95% | PASS |
Five specialists, deterministic 5% holdouts, each at least +20pp over the majority-class floor. The architecture clears its falsifiability gate on fresh data -- what makes that gate honest is documented in evaluation discipline and when the proxy breaks.
Current state: a V4 diagnostic experiment is queued to discriminate between two competing hypotheses on why two of the second-generation specialists stalled at the data ceiling. The live journal: Cluster Shell V4 -- Diagnostic.
Research index: https://github.com/thron-j/qovaryx-ai-research
Implementation details, training corpora, and certain ablation specifics are intentionally withheld in the public devlog. The framings are publishable; the internals are not. Collaboration inquiries: jeherizonllc@gmail.com.
Support
If this base helps you build something, support continued development:
Every contribution funds GPU time and the next-generation Qovaryx training runs.
Sibling models in this lineage
tjarvis91/qovaryx-50m-scratch-base-- 47M params, 12 layers, fits on any GPUtjarvis91/qovaryx-350m-scratch-base-- 352M params, 24 layers, for serious solo trainingtjarvis91/qovaryx-1b-scratch-base<- you are heretjarvis91/vfaix-vpa-options-trader-- a separate, trained 9B vision-language model that uses the same training disciplines on Qwen3.5-VL (not the same architecture; shown here for lineage context)
Citation
@misc{qovaryx-scratch-base-2026,
title = {Qovaryx: A Compact Decoder Architecture with Multi-Token Prediction, GQA, and Pluggable FFN Backends},
author = {Jarvis, Thomas},
year = {2026},
month = {May},
publisher = {Hugging Face},
url = {https://huggingface.co/tjarvis91/qovaryx-1b-scratch-base}
}
Status
Random-init checkpoint as of 2026-05-22. Future updates will add trained sibling repos with downstream task heads enabled (decision head + chart-patch encoder variants). Watch the org page for new releases.
- Downloads last month
- 31