You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Galahad-0.5B (base)

An open ~570M-parameter language model, pretrained for €600 (US$676.95). Released by Corbenic AI. This is the base (completion) model — the original pretrained weights, released as the open result of Corbenic's data + training stack.

What this is — and isn't. Galahad-0.5B is competent for its size, not a leaderboard winner. It is a small base LM: it completes text, it is not instruction-tuned, and it loses to comparable open baselines on standard benchmarks. We say so plainly. The point of Galahad is that it is cheap, open, and losslessly reusable — the substrate on which we demonstrate the Taliesin memory engine (see below). The capability we care about lives in the engine, not in these weights.

Quick facts

Parameters ~570M
Architecture decoder-only · hidden 1024 · 30 layers · 16 heads · head_dim 64 · SwiGLU · RMSNorm · vocab 65,536 · tied embeddings
Extra norms (v10) per-head q/k/v norm + gate/up norm (not in stock Llama — needs trust_remote_code=True)
Positional RoPE = interleaved / GPT-J convention (see warning below)
Attention sliding-window, 1024 tokens (hard window every layer)
Pretraining cost €600 (US$676.95) — full bill
License Apache-2.0
Type base / completion (NOT instruction-tuned)

⚠️ RoPE convention (read before porting)

Galahad uses the interleaved (GPT-J) RoPE convention: rotate_half swaps adjacent pairs and cos/sin are repeat_interleave(2). The included modeling_galahad.py already does this. If you convert to llama.cpp / GGUF, use LLAMA_ROPE_TYPE_NORM (mode 0), not NEOX — the wrong convention passes short prompts and silently collapses on long ones.

How to use

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo = "Corbenic/Galahad-0.5B-base"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True,
                                             torch_dtype=torch.bfloat16).eval()

ids = tok("The history of the printing press began", return_tensors="pt").input_ids
out = model.generate(ids, max_new_tokens=64)
print(tok.decode(out[0], skip_special_tokens=True))

This is a base model — it continues text, it does not follow instructions. Note the 1024-token sliding window: inputs longer than 1024 tokens are attended only within the most recent 1024.

Benchmarks (honest)

Measured on the released interleaved weights:

Benchmark Galahad-0.5B base
enwik8 (BPB / token-PPL) 1.10 / 7.33
text8 (BPB) 1.276
LAMBADA (acc / acc-norm) 13.96% / 29.49%
BLiMP (macro) 0.722

It loses to same-class baselines (e.g. Transformer-XL, Pythia). That is expected and fine — the value proposition is cheap + open + losslessly reusable, not best-in-class accuracy.

Training integrity note

During development we found and fixed a RoPE-convention bug ourselves (a half-split port where the trained convention is interleaved). On a held-out check it moved enwik8 from 2.54 → 1.10 BPB and BLiMP 0.587 → 0.722. We mention it because it shows the numbers above are the corrected forward pass, not a lucky configuration.

Training data

Galahad's pretraining corpus was fully deduplicated with Merlin, Corbenic's byte-exact deduplication engine. Removing duplicated training data is an established way to improve a model's quality-per-token and reduce redundant training compute (Lee et al., 2021, Deduplicating Training Data Makes Language Models Better). We state this as a method fact — we do not claim a head-to-head win over models trained on non-deduplicated data, as no controlled comparison exists.

Taliesin — the memory engine (NOT in this repo)

Galahad exists to demonstrate Taliesin, Corbenic's external memory engine. A model's internal context state can be saved and restored so the same context is not recomputed every time.

Be precise about what is what:

  • The foundation — byte-exact, reproducible KV state — is a property of a deterministic engine, and is verifiable with public tooling (standard llama.cpp llama_state_seq_save_file / load_file under GGML_DETERMINISTIC=1). We prove it with public tools on purpose, so you can check the foundation without any of our software. Example receipt: a KV state written to disk by one process, reloaded by a separate fresh process, produces logits byte-identical (SHA-256) to a from-scratch computation. A volatile prompt cache (vLLM / hosted prompt caches) cannot survive a process death; this does.
  • Taliesin is the memory system built on that foundation — and it does what a whole-sequence snapshot cannot: content-addressed cross-context grafting (splice a stored span into a different context/position), composition of independent spans, deduplication (see Merlin), one engine across vendors, and tiered storage — with the resulting speedup. That is the proprietary part.

Taliesin is not distributed in any form. Its central property — exact, verifiable losslessness — is checkable from the published receipts — no Corbenic software needed to verify the core claim. The engine itself stays closed.

Receipts

Published SHA-256 receipts for the byte-exact / cross-vendor / persistence results live in the launch evidence dataset: https://huggingface.co/datasets/Corbenic/taliesin-receipts. (Cross-vendor byte-exact reuse was verified on Llama-3.1-8B, Qwen2.5-7B and Mistral-7B; disk-roundtrip persistence on open weights. The bit-exact property is a kernel-determinism property and is model-agnostic.)

Related work

  • Merlin — Corbenic's byte-exact deduplication / lossless-inference engine, published on arXiv: arXiv:2605.09990 (Schelpe, 2026). Companion empirical analysis: arXiv:2605.09611.

Citation

@misc{corbenic2026galahad,
  title  = {Galahad-0.5B: an open, low-cost language model},
  author = {Corbenic AI},
  year   = {2026},
  note   = {Apache-2.0. https://corbenic.ai}
}

Contact & license

Galahad-0.5B is released under Apache-2.0. Contact: sietse@corbenic.ai · https://corbenic.ai


We do not claim Galahad-0.5B outperforms larger models. It does not. The narrow, verifiable claim is lossless, byte-exact context reuse — demonstrated across multiple vendors' models and on this €600 one.

Downloads last month
7
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for Corbenic/Galahad-0.5B-base