TheoBERT Search

theo-bert-search is a domain-specialized encoder for biblical and theological retrieval. It is the contrastive-mid-trained variant of theo-bert-base — same architecture, same lineage, then sharpened on hard-negative triplets so the mean-pooled sentence vector is suitable for cosine search, ranking, clustering, and RAG over biblical and theology-adjacent corpora.

This is the right checkpoint if you want to build a search engine, retrieval-augmented generation pipeline, or semantic similarity tooling over the bible-domain. For masked language modeling or token-level scoring, use the sibling release theo-bert-base instead.

What This Model Is For

Recommended use cases:

Semantic search over biblical, doctrinal, or theology-adjacent corpora
Retrieval-augmented generation (RAG) with theological document corpora
Embedding generation for ranking, reranking, clustering, and nearest-neighbor workflows
Sentence-level similarity scoring on theology text

Use the encoder body, mean-pool over real tokens, and L2-normalize the result for cosine similarity. The MLM head from the prior pretraining stages is preserved in the weights but is not usable — contrastive mid-training reshapes the body's geometry for sentence-level similarity, which moves token-level hidden states out of distribution for the original head. Calling AutoModelForMaskedLM.from_pretrained(...) will load successfully but produce degraded fill-mask predictions.

Training Pipeline

This release is the output of a three-stage pipeline. The first two stages are shared with theo-bert-base; this card describes the additional stage that produces the search-shaped artifact.

Stage 1 — MLM pretraining (encoder) 24 epochs of masked language modeling from scratch at 256-token context on 270,000 biblical and theological sequences. Final train loss 1.0679, train accuracy 76.42%.

Stage 2 — Whole-word-masking continued pretraining (mlmcontinued) 25 additional epochs with 18% whole-word-masking. Final train loss 0.8958, train accuracy 79.66%. This is the checkpoint shipped as theo-bert-base. It achieves 94.7% on a 546-case domain MLM benchmark.

Stage 3 — Contrastive mid-training (contrastive_midtraining) — this release

21,319 hand-curated (query, positive, negative) triplets drawn from the bible-domain corpus
InfoNCE loss with explicit hard negatives plus symmetric reverse direction
In-batch positives serve as soft negatives; the row's own negative is forced into the denominator
Temperature 0.05, batch size 128, sequence length 256
Trained for 9 epochs; epoch 4 selected as the released checkpoint based on per-epoch retrieval-margin evaluation
Training loss at epoch 4: 1.6522

Why epoch 4 and not the final epoch

Per-epoch monitoring on the in-distribution sample showed that retrieval margin peaks at epoch 4 and degrades afterward as training loss collapses below ~1.0 — a classic memorization U-curve once the model has fingerprinted individual training rows. Epoch 4 is the optimum: the body has opened enough to discriminate well (mean_pos / mean_neg cosines around 0.90 / 0.75) without overfitting the triplet table.

Evaluation

Evaluated on a held-out-style sample of 50 triplets drawn from the same distribution as the training data, comparing this checkpoint against the Stage 2 baseline:

Checkpoint	Win rate	Mean margin	Mean(pos)	Mean(neg)
`theo-bert-base` (Stage 2, no contrastive)	32 / 50 ¹	+0.00006	0.9998	0.9998
`theo-bert-search` (Stage 3, epoch 4)	44 / 50 (88%)	+0.1516	0.8983	0.7467

¹ The Stage 2 baseline's nominal "32 / 50" wins are not meaningful — every cosine is ~0.9998, so the model has no discriminative geometry. Wins are 5th-decimal-place noise. Lookback for context: this is the well-known anisotropic-collapse pathology of mean-pooled MLM-only encoders, and is the reason a contrastive mid-training pass exists in this pipeline.

Tokenizer

Same as the base — uses the google-bert/bert-base-uncased tokenizer. The fast-tokenizer files (tokenizer.json, tokenizer_config.json) are bundled in this repo so AutoTokenizer.from_pretrained("toranb/theo-bert-search") works without an extra download step.

Tokenizer files are redistributed unmodified from google-bert/bert-base-uncased, released by Google under the Apache License 2.0.

Architecture

Identical to theo-bert-base:

12 transformer blocks, hidden size 768, 8 attention heads (head dim 96)
Training sequence length 256 (rotary cache supports up to 2,560 tokens)
Vocabulary 30,522 via bert-base-uncased
RoPE on Q/K, RMS norm on Q/K, ReLU-squared MLP, gated value embeddings on even layers, learned residual interpolation, MLM head: Linear → GELU → RMSNorm → Linear

Parameter count: 273,051,864 (≈273M).

Quick Start — Embeddings + Cosine Retrieval

import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

repo = "toranb/theo-bert-search"

tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModel.from_pretrained(repo, trust_remote_code=True)
model.eval()

def embed(texts, max_len=256):
    enc = tokenizer(texts, padding=True, truncation=True,
                    max_length=max_len, return_tensors="pt")
    with torch.no_grad():
        hidden = model(**enc).last_hidden_state  # [B, T, 768]
    mask = enc["attention_mask"].unsqueeze(-1).float()
    pooled = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1)
    return F.normalize(pooled, dim=-1)  # unit-norm → dot product = cosine

query = embed(["how is a sinner justified before a holy god"])
docs  = embed([
    "a person is justified by faith apart from works of the law",
    "a person is justified by keeping the commandments and good behavior",
    "the kingdom of God is at hand",
])

scores = (query @ docs.T).squeeze(0)
print(scores.tolist())  # higher = more relevant
print("top doc:", scores.argmax().item())

The pooling recipe (mean over real tokens, then L2-normalize) matches the contrastive training-time pipeline. Cosine similarity is just the dot product of normalized vectors.

Repository Contents

File	Purpose
`configuration_theo_bert_base.py`	Hugging Face config class (shared with base)
`modeling_theo_bert_base.py`	`AutoModel` / `AutoModelForMaskedLM` implementations (shared with base)
`muon.py`	Local Muon optimizer (retained for self-contained fine-tuning)
`config.json`	Architecture / loader config
`model.safetensors`	Released fp16 weights (Stage 3 / epoch 4)
`checkpoint_metadata.json`	Per-stage training metadata and retrieval eval
`LICENSE`	Apache-2.0

Sibling Release

theo-bert-base — the Stage 2 checkpoint, with the MLM head intact. Use this if you want fill-mask or a starting point for your own token-level fine-tuning. MLM-eval pass rate 94.7%.

Limitations

The MLM head is structurally present but produces degraded fill-mask output. Use theo-bert-base for masked language modeling.
Specialized for biblical and theological language; sentence-level similarity on broad general-domain text is untested and likely weaker than purpose-built general models.
Retrieval evaluation is in-distribution. Out-of-distribution sentence-level tasks (e.g., doctrinal-counterfeit discrimination such as "saved by faith" vs "saved by works") are underdeveloped: the training triplets emphasize passage-level retrieval rather than fine-grained doctrinal discrimination. Adding doctrinal-counterfeit triplets in a future revision is on the roadmap.
Trained at 256-token context. Inputs up to 2,560 tokens fit in the rotary cache, but the model has not been tuned for long context.
Training data is private; external auditing of corpus composition is limited.

Release Details

Exported from contrastive_midtraining/epoch_0004.pt (Stage 3, epoch 4 — the per-epoch sweep optimum)
Source checkpoint training loss 1.6522
Released weights in fp16 for bandwidth efficiency (546 MB)
Release format uses safetensors
Loading requires trust_remote_code=True to register the custom architecture
config.json declares torch_dtype: float32 so default loads upcast on read. Disk weights stay fp16 (small download); CPU inference is numerically safe by default. For GPU fp16 inference, pass dtype=torch.float16 to from_pretrained.

Downloads last month: 12

Safetensors

Model size

0.3B params

Tensor type

F16

Evaluation results

Win rate over 50 in-distribution (query, positive, negative) triplets
self-reported

0.880
Mean cosine margin between positive and negative
self-reported

0.152