Instructions to use toranb/theo-bert-search with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use toranb/theo-bert-search with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="toranb/theo-bert-search", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("toranb/theo-bert-search", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
TheoBERT Search
theo-bert-search is a domain-specialized encoder for biblical and theological retrieval. It is the contrastive-mid-trained variant of theo-bert-base β same architecture, same lineage, then sharpened on hard-negative triplets so the mean-pooled sentence vector is suitable for cosine search, ranking, clustering, and RAG over biblical and theology-adjacent corpora.
This is the right checkpoint if you want to build a search engine, retrieval-augmented generation pipeline, or semantic similarity tooling over the bible-domain. For masked language modeling or token-level scoring, use the sibling release theo-bert-base instead.
What This Model Is For
Recommended use cases:
- Semantic search over biblical, doctrinal, or theology-adjacent corpora
- Retrieval-augmented generation (RAG) with theological document corpora
- Embedding generation for ranking, reranking, clustering, and nearest-neighbor workflows
- Sentence-level similarity scoring on theology text
Use the encoder body, mean-pool over real tokens, and L2-normalize the result for cosine similarity. The MLM head from the prior pretraining stages is preserved in the weights but is not usable β contrastive mid-training reshapes the body's geometry for sentence-level similarity, which moves token-level hidden states out of distribution for the original head. Calling AutoModelForMaskedLM.from_pretrained(...) will load successfully but produce degraded fill-mask predictions.
Training Pipeline
This release is the output of a three-stage pipeline. The first two stages are shared with theo-bert-base; this card describes the additional stage that produces the search-shaped artifact.
Stage 1 β MLM pretraining (encoder)
24 epochs of masked language modeling from scratch at 256-token context on 270,000 biblical and theological sequences. Final train loss 1.0679, train accuracy 76.42%.
Stage 2 β Whole-word-masking continued pretraining (mlmcontinued)
25 additional epochs with 18% whole-word-masking. Final train loss 0.8958, train accuracy 79.66%. This is the checkpoint shipped as theo-bert-base. It achieves 94.7% on a 546-case domain MLM benchmark.
Stage 3 β Contrastive mid-training (contrastive_midtraining) β this release
- 21,319 hand-curated
(query, positive, negative)triplets drawn from the bible-domain corpus - InfoNCE loss with explicit hard negatives plus symmetric reverse direction
- In-batch positives serve as soft negatives; the row's own negative is forced into the denominator
- Temperature
0.05, batch size128, sequence length256 - Trained for 9 epochs; epoch 4 selected as the released checkpoint based on per-epoch retrieval-margin evaluation
- Training loss at epoch 4:
1.6522
Why epoch 4 and not the final epoch
Per-epoch monitoring on the in-distribution sample showed that retrieval margin peaks at epoch 4 and degrades afterward as training loss collapses below ~1.0 β a classic memorization U-curve once the model has fingerprinted individual training rows. Epoch 4 is the optimum: the body has opened enough to discriminate well (mean_pos / mean_neg cosines around 0.90 / 0.75) without overfitting the triplet table.
Evaluation
Evaluated on a held-out-style sample of 50 triplets drawn from the same distribution as the training data, comparing this checkpoint against the Stage 2 baseline:
| Checkpoint | Win rate | Mean margin | Mean(pos) | Mean(neg) |
|---|---|---|---|---|
theo-bert-base (Stage 2, no contrastive) |
32 / 50 ΒΉ | +0.00006 | 0.9998 | 0.9998 |
theo-bert-search (Stage 3, epoch 4) |
44 / 50 (88%) | +0.1516 | 0.8983 | 0.7467 |
ΒΉ The Stage 2 baseline's nominal "32 / 50" wins are not meaningful β every cosine is ~0.9998, so the model has no discriminative geometry. Wins are 5th-decimal-place noise. Lookback for context: this is the well-known anisotropic-collapse pathology of mean-pooled MLM-only encoders, and is the reason a contrastive mid-training pass exists in this pipeline.
Tokenizer
Same as the base β uses the google-bert/bert-base-uncased tokenizer. The fast-tokenizer files (tokenizer.json, tokenizer_config.json) are bundled in this repo so AutoTokenizer.from_pretrained("toranb/theo-bert-search") works without an extra download step.
Tokenizer files are redistributed unmodified from google-bert/bert-base-uncased, released by Google under the Apache License 2.0.
Architecture
Identical to theo-bert-base:
- 12 transformer blocks, hidden size 768, 8 attention heads (head dim 96)
- Training sequence length 256 (rotary cache supports up to 2,560 tokens)
- Vocabulary 30,522 via
bert-base-uncased - RoPE on Q/K, RMS norm on Q/K, ReLU-squared MLP, gated value embeddings on even layers, learned residual interpolation, MLM head:
Linear β GELU β RMSNorm β Linear
Parameter count: 273,051,864 (β273M).
Quick Start β Embeddings + Cosine Retrieval
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
repo = "toranb/theo-bert-search"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModel.from_pretrained(repo, trust_remote_code=True)
model.eval()
def embed(texts, max_len=256):
enc = tokenizer(texts, padding=True, truncation=True,
max_length=max_len, return_tensors="pt")
with torch.no_grad():
hidden = model(**enc).last_hidden_state # [B, T, 768]
mask = enc["attention_mask"].unsqueeze(-1).float()
pooled = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1)
return F.normalize(pooled, dim=-1) # unit-norm β dot product = cosine
query = embed(["how is a sinner justified before a holy god"])
docs = embed([
"a person is justified by faith apart from works of the law",
"a person is justified by keeping the commandments and good behavior",
"the kingdom of God is at hand",
])
scores = (query @ docs.T).squeeze(0)
print(scores.tolist()) # higher = more relevant
print("top doc:", scores.argmax().item())
The pooling recipe (mean over real tokens, then L2-normalize) matches the contrastive training-time pipeline. Cosine similarity is just the dot product of normalized vectors.
Repository Contents
| File | Purpose |
|---|---|
configuration_theo_bert_base.py |
Hugging Face config class (shared with base) |
modeling_theo_bert_base.py |
AutoModel / AutoModelForMaskedLM implementations (shared with base) |
muon.py |
Local Muon optimizer (retained for self-contained fine-tuning) |
config.json |
Architecture / loader config |
model.safetensors |
Released fp16 weights (Stage 3 / epoch 4) |
checkpoint_metadata.json |
Per-stage training metadata and retrieval eval |
LICENSE |
Apache-2.0 |
Sibling Release
theo-bert-baseβ the Stage 2 checkpoint, with the MLM head intact. Use this if you want fill-mask or a starting point for your own token-level fine-tuning. MLM-eval pass rate 94.7%.
Limitations
- The MLM head is structurally present but produces degraded fill-mask output. Use
theo-bert-basefor masked language modeling. - Specialized for biblical and theological language; sentence-level similarity on broad general-domain text is untested and likely weaker than purpose-built general models.
- Retrieval evaluation is in-distribution. Out-of-distribution sentence-level tasks (e.g., doctrinal-counterfeit discrimination such as "saved by faith" vs "saved by works") are underdeveloped: the training triplets emphasize passage-level retrieval rather than fine-grained doctrinal discrimination. Adding doctrinal-counterfeit triplets in a future revision is on the roadmap.
- Trained at 256-token context. Inputs up to 2,560 tokens fit in the rotary cache, but the model has not been tuned for long context.
- Training data is private; external auditing of corpus composition is limited.
Release Details
- Exported from
contrastive_midtraining/epoch_0004.pt(Stage 3, epoch 4 β the per-epoch sweep optimum) - Source checkpoint training loss
1.6522 - Released weights in fp16 for bandwidth efficiency (546 MB)
- Release format uses
safetensors - Loading requires
trust_remote_code=Trueto register the custom architecture config.jsondeclarestorch_dtype: float32so default loads upcast on read. Disk weights stay fp16 (small download); CPU inference is numerically safe by default. For GPU fp16 inference, passdtype=torch.float16tofrom_pretrained.
- Downloads last month
- 12
Evaluation results
- Win rate over 50 in-distribution (query, positive, negative) tripletsself-reported0.880
- Mean cosine margin between positive and negativeself-reported0.152