TheoBERT Search

theo-bert-search is a domain-specialized encoder for biblical and theological retrieval. It is the contrastive-mid-trained variant of theo-bert-base β€” same architecture, same lineage, then sharpened on hard-negative triplets so the mean-pooled sentence vector is suitable for cosine search, ranking, clustering, and RAG over biblical and theology-adjacent corpora.

This is the right checkpoint if you want to build a search engine, retrieval-augmented generation pipeline, or semantic similarity tooling over the bible-domain. For masked language modeling or token-level scoring, use the sibling release theo-bert-base instead.

What This Model Is For

Recommended use cases:

  • Semantic search over biblical, doctrinal, or theology-adjacent corpora
  • Retrieval-augmented generation (RAG) with theological document corpora
  • Embedding generation for ranking, reranking, clustering, and nearest-neighbor workflows
  • Sentence-level similarity scoring on theology text

Use the encoder body, mean-pool over real tokens, and L2-normalize the result for cosine similarity. The MLM head from the prior pretraining stages is preserved in the weights but is not usable β€” contrastive mid-training reshapes the body's geometry for sentence-level similarity, which moves token-level hidden states out of distribution for the original head. Calling AutoModelForMaskedLM.from_pretrained(...) will load successfully but produce degraded fill-mask predictions.

Training Pipeline

This release is the output of a three-stage pipeline. The first two stages are shared with theo-bert-base; this card describes the additional stage that produces the search-shaped artifact.

Stage 1 β€” MLM pretraining (encoder) 24 epochs of masked language modeling from scratch at 256-token context on 270,000 biblical and theological sequences. Final train loss 1.0679, train accuracy 76.42%.

Stage 2 β€” Whole-word-masking continued pretraining (mlmcontinued) 25 additional epochs with 18% whole-word-masking. Final train loss 0.8958, train accuracy 79.66%. This is the checkpoint shipped as theo-bert-base. It achieves 94.7% on a 546-case domain MLM benchmark.

Stage 3 β€” Contrastive mid-training (contrastive_midtraining) β€” this release

  • 21,319 hand-curated (query, positive, negative) triplets drawn from the bible-domain corpus
  • InfoNCE loss with explicit hard negatives plus symmetric reverse direction
  • In-batch positives serve as soft negatives; the row's own negative is forced into the denominator
  • Temperature 0.05, batch size 128, sequence length 256
  • Trained for 9 epochs; epoch 4 selected as the released checkpoint based on per-epoch retrieval-margin evaluation
  • Training loss at epoch 4: 1.6522

Why epoch 4 and not the final epoch

Per-epoch monitoring on the in-distribution sample showed that retrieval margin peaks at epoch 4 and degrades afterward as training loss collapses below ~1.0 β€” a classic memorization U-curve once the model has fingerprinted individual training rows. Epoch 4 is the optimum: the body has opened enough to discriminate well (mean_pos / mean_neg cosines around 0.90 / 0.75) without overfitting the triplet table.

Evaluation

Evaluated on a held-out-style sample of 50 triplets drawn from the same distribution as the training data, comparing this checkpoint against the Stage 2 baseline:

Checkpoint Win rate Mean margin Mean(pos) Mean(neg)
theo-bert-base (Stage 2, no contrastive) 32 / 50 ΒΉ +0.00006 0.9998 0.9998
theo-bert-search (Stage 3, epoch 4) 44 / 50 (88%) +0.1516 0.8983 0.7467

ΒΉ The Stage 2 baseline's nominal "32 / 50" wins are not meaningful β€” every cosine is ~0.9998, so the model has no discriminative geometry. Wins are 5th-decimal-place noise. Lookback for context: this is the well-known anisotropic-collapse pathology of mean-pooled MLM-only encoders, and is the reason a contrastive mid-training pass exists in this pipeline.

Tokenizer

Same as the base β€” uses the google-bert/bert-base-uncased tokenizer. The fast-tokenizer files (tokenizer.json, tokenizer_config.json) are bundled in this repo so AutoTokenizer.from_pretrained("toranb/theo-bert-search") works without an extra download step.

Tokenizer files are redistributed unmodified from google-bert/bert-base-uncased, released by Google under the Apache License 2.0.

Architecture

Identical to theo-bert-base:

  • 12 transformer blocks, hidden size 768, 8 attention heads (head dim 96)
  • Training sequence length 256 (rotary cache supports up to 2,560 tokens)
  • Vocabulary 30,522 via bert-base-uncased
  • RoPE on Q/K, RMS norm on Q/K, ReLU-squared MLP, gated value embeddings on even layers, learned residual interpolation, MLM head: Linear β†’ GELU β†’ RMSNorm β†’ Linear

Parameter count: 273,051,864 (β‰ˆ273M).

Quick Start β€” Embeddings + Cosine Retrieval

import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

repo = "toranb/theo-bert-search"

tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModel.from_pretrained(repo, trust_remote_code=True)
model.eval()

def embed(texts, max_len=256):
    enc = tokenizer(texts, padding=True, truncation=True,
                    max_length=max_len, return_tensors="pt")
    with torch.no_grad():
        hidden = model(**enc).last_hidden_state  # [B, T, 768]
    mask = enc["attention_mask"].unsqueeze(-1).float()
    pooled = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1)
    return F.normalize(pooled, dim=-1)  # unit-norm β†’ dot product = cosine

query = embed(["how is a sinner justified before a holy god"])
docs  = embed([
    "a person is justified by faith apart from works of the law",
    "a person is justified by keeping the commandments and good behavior",
    "the kingdom of God is at hand",
])

scores = (query @ docs.T).squeeze(0)
print(scores.tolist())  # higher = more relevant
print("top doc:", scores.argmax().item())

The pooling recipe (mean over real tokens, then L2-normalize) matches the contrastive training-time pipeline. Cosine similarity is just the dot product of normalized vectors.

Repository Contents

File Purpose
configuration_theo_bert_base.py Hugging Face config class (shared with base)
modeling_theo_bert_base.py AutoModel / AutoModelForMaskedLM implementations (shared with base)
muon.py Local Muon optimizer (retained for self-contained fine-tuning)
config.json Architecture / loader config
model.safetensors Released fp16 weights (Stage 3 / epoch 4)
checkpoint_metadata.json Per-stage training metadata and retrieval eval
LICENSE Apache-2.0

Sibling Release

  • theo-bert-base β€” the Stage 2 checkpoint, with the MLM head intact. Use this if you want fill-mask or a starting point for your own token-level fine-tuning. MLM-eval pass rate 94.7%.

Limitations

  • The MLM head is structurally present but produces degraded fill-mask output. Use theo-bert-base for masked language modeling.
  • Specialized for biblical and theological language; sentence-level similarity on broad general-domain text is untested and likely weaker than purpose-built general models.
  • Retrieval evaluation is in-distribution. Out-of-distribution sentence-level tasks (e.g., doctrinal-counterfeit discrimination such as "saved by faith" vs "saved by works") are underdeveloped: the training triplets emphasize passage-level retrieval rather than fine-grained doctrinal discrimination. Adding doctrinal-counterfeit triplets in a future revision is on the roadmap.
  • Trained at 256-token context. Inputs up to 2,560 tokens fit in the rotary cache, but the model has not been tuned for long context.
  • Training data is private; external auditing of corpus composition is limited.

Release Details

  • Exported from contrastive_midtraining/epoch_0004.pt (Stage 3, epoch 4 β€” the per-epoch sweep optimum)
  • Source checkpoint training loss 1.6522
  • Released weights in fp16 for bandwidth efficiency (546 MB)
  • Release format uses safetensors
  • Loading requires trust_remote_code=True to register the custom architecture
  • config.json declares torch_dtype: float32 so default loads upcast on read. Disk weights stay fp16 (small download); CPU inference is numerically safe by default. For GPU fp16 inference, pass dtype=torch.float16 to from_pretrained.
Downloads last month
12
Safetensors
Model size
0.3B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Evaluation results

  • Win rate over 50 in-distribution (query, positive, negative) triplets
    self-reported
    0.880
  • Mean cosine margin between positive and negative
    self-reported
    0.152