Aletheia 1.0 — the decision-memory gate

Version note: This is Aletheia 1.0, the first Cortex gate — a 1.5-billion-parameter Qwen2.5 decoder. It is superseded by Aletheia 1.5, a fine-tuned DeBERTa-v3-large encoder that matches this model's precision at ~⅓ the RAM. (1.0 was previously published as aletheia-1.5b, named for its 1.5B parameters — renamed to a clean version number to avoid colliding with the 1.5 release.)

ἀλήθεια — "un-forgetting". Aletheia decides what is worth remembering: given a short candidate utterance from a coding session (a commit subject or a turn from an AI-coding conversation), it answers one question — is this a real, substantive engineering decision, or is it noise? It is the write-gate of Memtrace's Cortex decision memory: only what Aletheia confidently judges a decision is kept.

  • Task: binary sequence classification — decision vs noise.
  • Base: Qwen/Qwen2.5-1.5B-Instruct, LoRA-fine-tuned as a sequence classifier.
  • Format: INT8-quantized ONNX (~1.4 GB), runs on-device via ONNX Runtime — no GPU, no network, no per-call cost.
  • Output: a single logit → P(decision) = sigmoid(logit / T) with calibration temperature T = 0.698.

Why a small local model

Decisions must be gated continuously, on every commit and every agent turn. That rules out a cloud LLM (cost, latency, and your code would leave the machine) and rules out heuristics (they cap at ~78% precision and can't read intent). A ~1 GB on-device classifier is the only thing that is private, free, offline, always-on, and accurate at once.

Results

Evaluated on held-out, leakage-guarded test sets, apples-to-apples against the prior baseline.

Test set Metric Aletheia 1.0
Conversational (in-register, held-out, n=1,589) ROC-AUC 0.933
Conversational accuracy 0.850
Cross-register benchmark (hand-labeled, n=195) ROC-AUC 0.844

Precision is a dial: at a balanced threshold 80% of what it stores is a genuine decision; in "clean mode" (P ≥ 0.85) that rises to **90–92%**, at the cost of storing fewer. The model emits calibrated probabilities (temperature-scaled), so the threshold means what it says.

Intended use

The decision write-gate / proposer for a code-decision memory system. It is register-robust: trained on both git-commit subjects and conversational turns, so the same model scores both streams. Downstream, a deterministic check (a code edit binding to the turn, or a human) promotes a proposed decision to a durable fact.

Out of scope: it is not a retrieval/search model, not a code generator, and not a general chat classifier. It judges decision-worthiness, nothing else.

How to use

ONNX Runtime (the shipping path — Python)

import numpy as np, onnxruntime as ort
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("memtrace/aletheia-1.0")
sess = ort.InferenceSession("model_int8.onnx", providers=["CPUExecutionProvider"])
T = 0.698
def p_decision(text):
    e = tok(text, truncation=True, max_length=64, return_tensors="np")
    logit = sess.run(None, {"input_ids": e["input_ids"].astype(np.int64),
                            "attention_mask": e["attention_mask"].astype(np.int64)})[0].reshape(-1)[0]
    return 1 / (1 + np.exp(-logit / T))

p_decision("Switch auth to JWT instead of sessions")        # ~0.91  → decision
p_decision("let me check the file rather than re-reading")   # ~0.06 → noise

Optimum (transformers-compatible)

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer
m = ORTModelForSequenceClassification.from_pretrained("memtrace/aletheia-1.0", file_name="model_int8.onnx")
tok = AutoTokenizer.from_pretrained("memtrace/aletheia-1.0")

Rust (ort, in-product)

Memtrace loads model_int8.onnx + tokenizer.json via the ort crate; see cortex_serving.json for the temperature and default thresholds. Input names: input_ids, attention_mask; output: logits.

Training

  • Data: 24,895 multi-judge-labeled examples — 14,305 git-commit subjects + 10,590 turns mined from real AI-coding sessions. Labels are LLM multi-judge consensus (2 judges for commits; 3 diverse-lens judges + majority for conversation), ~95% inter-judge agreement; CROWDLAB confirmed the consensus labels were already near-optimal.
  • Commit sources (license-clean): CommitPackFT (MIT, 74 languages), CommitChronicle, and tangled-ccs. CommitBench (CC-BY-NC) was excluded so the shipped model is commercial-clean.
  • Recipe: LoRA sequence-classifier with soft-label training (vote-fraction BCEWithLogits, so judge disagreement is modeled rather than forced to 0/1), rsLoRA, LoRA+, MLP target modules, best-checkpoint-by-AUC, post-hoc temperature scaling. The decisive lever was soft labels: the ceiling was label noise, not data or model size.
  • Compute: trained locally on Apple Silicon (no rented GPU).

Limitations & honest notes

  • Label-noise ceiling: you cannot score above the noise in the test labels themselves (~0.88–0.90 practical max on the cross-register set). The cross-register number (0.844) is lower than the in-register one (0.933) partly because that benchmark's hand labels are a different, noisier standard.
  • Footprint: ~1.4 GB on disk, ~2.8–3.5 GB resident (ONNX Runtime dequantizes the INT8 weights to fp32 on CPU). This is the motivation for Aletheia 1.5.
  • English-centric conversational phrasing; commit data spans 74 languages but conversational decision-detection is English-tuned.
  • It only proposes; it should be paired with a deterministic confirmation/promotion step.

Version history

Version Base RAM Notes
1.0 Qwen2.5-1.5B (decoder, LoRA) ~3 GB first gate (this model)
1.5 DeBERTa-v3-large (encoder, full-FT) ~1.2 GB same precision, ⅓ the RAM

License

Apache-2.0 (inherited from the Qwen2.5-1.5B-Instruct base; the classifier head and weights are released under the same license). Training data is license-clean for commercial use.

Citation

@software{aletheia2026,
  title  = {Aletheia: an on-device decision-memory gate for code},
  author = {Syncable / Memtrace},
  year   = {2026},
  url    = {https://huggingface.co/memtrace/aletheia-1.0}
}
Downloads last month
23
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for memtrace/aletheia-1.0

Finetuned
(1679)
this model

Dataset used to train memtrace/aletheia-1.0

Evaluation results

  • ROC-AUC (in-register, held-out) on Cortex decisions — conversational held-out (n=1,589)
    self-reported
    0.933
  • Accuracy on Cortex decisions — conversational held-out (n=1,589)
    self-reported
    0.850
  • ROC-AUC on Cross-register benchmark (n=195)
    self-reported
    0.844