Instructions to use memtrace/aletheia-1.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use memtrace/aletheia-1.5 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="memtrace/aletheia-1.5")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("memtrace/aletheia-1.5") model = AutoModelForSequenceClassification.from_pretrained("memtrace/aletheia-1.5") - Notebooks
- Google Colab
- Kaggle
Aletheia 1.5 — the decision-memory gate, now an encoder
ἀλήθεια — "un-forgetting". Aletheia decides what is worth remembering: given a short candidate utterance from a coding session (a commit subject, or a turn from an AI-coding conversation), it answers one question — is this a real, substantive engineering decision, or is it noise? It is the write-gate of Memtrace's Cortex decision memory: only what Aletheia confidently judges a decision is kept.
Aletheia 1.5 swaps the decoder for an encoder. Where Aletheia 1.0 was a 1.5B Qwen2.5 decoder, 1.5 is a fine-tuned DeBERTa-v3-large — the same precision at a fraction of the footprint. For a 64-token binary classifier, the encoder is simply the right tool.
- Task: binary sequence classification —
decisionvsnoise. - Base:
microsoft/deberta-v3-large(435M), fully fine-tuned as a sequence classifier. - Format: INT8-quantized ONNX (642 MB on disk, ~1.2 GB resident), runs on-device via ONNX Runtime — no GPU, no network, no per-call cost.
- Output: a single logit →
P(decision) = sigmoid(logit / T)with calibration temperatureT = 0.784, default keep threshold0.61.
Why an encoder (and not a bigger, newer decoder)
We tested the obvious upgrades — newer and smaller decoders — and the encoder won on the axis that matters for a gate: precision per byte of RAM. For short-text binary classification, a fine-tuned encoder learns a cleaner decision boundary than a decoder many times its size; in the literature a fine-tuned ~0.4B DeBERTa-v3-large beats fine-tuned 7B–13B decoders on binary tasks. On our data it matched the 1.5B at ⅓ the resident RAM, with trivial INT8/ONNX export and an MIT license. (Notably, the flashy newest small models — multimodal, gated-delta, "thinking" — are anti-features here: dead weight and export friction for a 64-token classifier.)
Results
Evaluated on held-out, leakage-guarded test sets — the same splits and protocol as Aletheia 1.0, apples-to-apples. The full sweep that selected the model:
| Model | params | commit-register AUC (n=195) | conversational AUC (n=1,589) | resident RAM |
|---|---|---|---|---|
| Aletheia 1.0 — Qwen2.5-1.5B | 1.5B | 0.844 | 0.933 | ~2.8–3.5 GB |
| Qwen2.5-0.5B (collapsed — rejected) | 0.5B | 0.695 | 0.861 | ~1.5 GB |
| Qwen3-0.6B (stable, near-parity) | 596M | 0.808 | 0.922 | ~0.6 GB |
| DeBERTa-v3-large · LoRA | 435M | 0.836 | 0.919 | ~0.5 GB |
| Aletheia 1.5 — DeBERTa-v3-large · full-FT | 435M | 0.832 | 0.933 | ~1.2 GB |
Conversational AUC is exact parity with the 1.5B (0.933); commit AUC is within test noise
(0.832 vs 0.844 on a 195-example set). Precision is a dial: at the default threshold 80% of
what it stores is a genuine decision; in "clean mode" it rises to **88% precision** on the
validation set. The probabilities are temperature-scaled, so the threshold means what it says.
What it keeps vs. drops
Kept — P(decision) high |
Dropped — P(decision) low |
|---|---|
| "Use Postgres instead of MongoDB for the event store" — 0.95 | "thanks that looks good to me" — 0.05 |
| "Determinism is structural: frozen FNV-1a hash, fixed EMBED_DIM=256" — 0.94 | "needs to be more space between nodes" — 0.08 |
| "Drop ArcadeDB and migrate the graph to MemDB" — 0.95 | "it has to be fluently and not something I discover" — 0.08 |
It correctly rejects agent narration, chit-chat, and context-free fragments while keeping self-contained engineering decisions.
Intended use
The decision write-gate / proposer for a code-decision memory system. It is register-robust — trained on both git-commit subjects and conversational turns, so one model scores both streams. Downstream, a deterministic check (a code edit binding to the turn, or a human) promotes a proposed decision to a durable fact.
Out of scope: not a retrieval/search model, not a code generator, not a general chat classifier. It judges decision-worthiness, nothing else.
How to use
ONNX Runtime (the shipping path — Python)
import numpy as np, onnxruntime as ort
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("memtrace/aletheia-1.5")
sess = ort.InferenceSession("model_int8.onnx", providers=["CPUExecutionProvider"])
T = 0.784
def p_decision(text):
e = tok(text, truncation=True, max_length=64, return_tensors="np")
logit = sess.run(None, {"input_ids": e["input_ids"].astype(np.int64),
"attention_mask": e["attention_mask"].astype(np.int64)})[0].reshape(-1)[0]
return 1 / (1 + np.exp(-logit / T))
p_decision("Switch auth to JWT instead of sessions") # ~0.95 → decision
p_decision("let me check the file rather than re-reading") # ~0.05 → noise
Optimum (transformers-compatible)
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer
m = ORTModelForSequenceClassification.from_pretrained("memtrace/aletheia-1.5", file_name="model_int8.onnx")
tok = AutoTokenizer.from_pretrained("memtrace/aletheia-1.5")
Rust (ort, in-product)
Memtrace's Cortex gate loads model_int8.onnx + tokenizer.json via the ort crate and reads
cortex_serving.json for the temperature and default threshold. Input names: input_ids,
attention_mask; output: logits. The gate auto-downloads this repo on first run when no local
model is present.
Training
- Data: 24,895 multi-judge-labeled examples — 14,305 git-commit subjects + 10,590 turns mined from real AI-coding sessions. Labels are LLM multi-judge consensus (2 judges for commits; 3 diverse-lens judges + majority for conversation), ~95% inter-judge agreement.
- Commit sources (license-clean): CommitPackFT (MIT, 74 languages), CommitChronicle, tangled-ccs. CommitBench (CC-BY-NC) was excluded so the shipped model is commercial-clean.
- Recipe: full fine-tune of DeBERTa-v3-large as a sequence classifier with soft-label
training (vote-fraction
BCEWithLogits, so judge disagreement is modeled rather than forced to 0/1), gradient checkpointing, best-checkpoint-by-AUC, post-hoc temperature scaling. Unlike the 1.0 decoder (LoRA), the encoder is small enough to fully fine-tune. - Compute: trained locally on Apple Silicon (no rented GPU).
Limitations & honest notes
- RAM is embedding-bound, not param-bound. DeBERTa-v3's 128k-token vocabulary is a
131M-param embedding table that ONNX Runtime dequantizes to fp32 at load (524 MB) — that, not the transformer, dominates the ~1.2 GB resident. A mixed int8-matmul + fp16-embedding build could reach ~700 MB (future work). - Commit register (0.832) trails the conversational register (0.933); the 195-example cross-register benchmark is small and its hand labels are a noisier standard.
- English-centric conversational phrasing (commit data spans 74 languages).
- It only proposes — pair it with a deterministic confirmation/promotion step.
Version history
| Version | Base | RAM | Notes |
|---|---|---|---|
| 1.0 | Qwen2.5-1.5B (decoder, LoRA) | ~3 GB | first gate |
| 1.5 | DeBERTa-v3-large (encoder, full-FT) | ~1.2 GB | same precision, ⅓ the RAM |
License
MIT (inherits from microsoft/deberta-v3-large). Training data is license-clean for commercial use.
Citation
@software{aletheia2026,
title = {Aletheia: an on-device decision-memory gate for code},
author = {Syncable / Memtrace},
year = {2026},
url = {https://huggingface.co/memtrace/aletheia-1.5}
}
- Downloads last month
- 36
Model tree for memtrace/aletheia-1.5
Base model
microsoft/deberta-v3-largeDataset used to train memtrace/aletheia-1.5
Evaluation results
- ROC-AUC (in-register, held-out) on Cortex decisions — conversational held-out (n=1,589)self-reported0.933
- Accuracy on Cortex decisions — conversational held-out (n=1,589)self-reported0.860
- ROC-AUC on Cross-register benchmark (n=195)self-reported0.832