Validation history of the SRT program
This document distills the validation evidence behind the SRT adapter (v8a)
into a single self-contained record. It covers the three validation stages
that established the architectural claims this adapter inherits:
- Stage 1. Synthetic controlled experiments (4/4 tests passed).
- Stage 2. Natural-language validation on a real news corpus (5/5
tests passed).
- Stage 3 Phase 1. Hybrid model on a frozen pretrained backbone
(4/5 tests passed at the first gate evaluation; the regime-classification
test motivated the remediation campaign that produced the lightweight
adapter program reported in
paper.pdf).
For the canonical theoretical record, see Lancaster (2025), "The Treachery
of Signs," SSRN 5987495, and
Lancaster (2026a), "Semiotic-Reflexive Language Model Training," SSRN
6349978. The numbers below
appear in those papers in their full original context; this file is the
concise summary referenced from the model card.
Stage 1: Synthetic controlled experiments
Gate G1: PASSED (4/4). Each test isolates one architectural claim on
synthetic data with planted ground-truth signals.
1.1 Subspace specialization (linear probing)
| Task |
Target subspace |
Target acc |
Control acc |
Margin |
Threshold |
| Token identity |
Representamen |
99.31% |
1.31% |
0.980 |
$\geq 0.15$ |
| Community membership |
Interpretant |
100.00% |
64.42% |
0.356 |
$\geq 0.15$ |
| Attractor basin |
Attractor |
100.00% |
84.52% |
0.155 |
$\geq 0.15$ |
| Position in sequence |
Object |
43.22% |
12.99% |
0.302 |
$\geq 0.15$ |
The four Peircean subspaces encode qualitatively different information.
1.2 Community differentiation
| Metric |
Value |
| Mean cosine distance, contested signs (20 words) |
0.3622 |
| Mean cosine distance, neutral signs (79 words) |
0.1103 |
| Ratio |
$3.28\times$ (threshold $\geq 3.0\times$) |
1.3 Divergence tracking
| Metric |
Value |
| Spearman $\rho$ between $\hat{r}$ and $r_{\text{true}}$ |
0.8220 ($p \approx 0$) |
| Samples |
64,000 |
| Threshold |
$\rho \geq 0.6$ |
1.4 Bifurcation detection
| Metric |
Value |
| Mean $\hat{r}$ difference, post minus pre |
0.6588 (threshold $> 0.2$) |
| Regime classification accuracy |
100.00% (threshold $> 75$%) |
| Samples |
500 |
Stage 2: Natural-language validation
Gate G2: PASSED (5/5). The full architecture re-tested on a curated
news corpus (5 communities, 19K articles, 141K Peircean sign annotations,
contested terms including freedom, justice, patriot).
2.1 Community embedding structure
| Metric |
Value |
| Contested silhouette |
0.5293 (threshold $> 0.15$) |
| Neutral silhouette |
0.3653 |
| Silhouette ratio (contested over neutral) |
$1.45\times$ (threshold $> 1.3\times$) |
| Samples |
5,000 contested + 5,000 neutral |
| Communities |
5 |
2.2 Divergence vectors on contested terms
| Metric |
Value |
| Group A (divergent connections) mean |
15.3730 |
| Group B (referential only) mean |
6.7001 |
| Ratio |
$2.29\times$ (threshold $\geq 2.0\times$) |
| Cohen's $d$ |
0.378 |
| Tokens |
81,839 vs 5,047 |
2.3 $\hat{r}$ vs external polarization
| Metric |
Value |
| Pearson $r$ |
0.8843 ($p \approx 0$) |
| Threshold |
$r \geq 0.3$ |
| Samples |
2,120 |
| Mean $\hat{r}$ |
0.3257 |
| Mean external divergence |
0.3759 |
2.4 Cross-topic transfer (zero-shot)
| Metric |
Value |
| Held-out contested mean divergence |
19.1582 (45,601 tokens) |
| Held-out neutral mean divergence |
14.6394 (1,265 tokens) |
| Ratio |
$1.31\times$ (threshold $> 1.3\times$) |
2.5 Regime classification on curated passages
| Metric |
Value |
| Accuracy |
85.00% (threshold $\geq 70$%) |
| ROC AUC |
0.8988 |
| Mean $\hat{r}$ low-divergence (50 passages) |
0.2845 ± 0.0619 |
| Mean $\hat{r}$ high-divergence (50 passages) |
0.4439 ± 0.0931 |
| Cohen's $d$ |
2.016 |
Stage 3 Phase 1: Hybrid model on a frozen backbone
Gate G3a: 3/5 at end of campaign (R21 through R105). The full Stage-2
architecture was grafted onto a frozen TinyLlama-1.1B backbone. The first
gate evaluation (R21) inverted Stage 2's failure pattern: four geometric
tests passed but the regime-classification head collapsed to a
supercritical bias (47% accuracy, well below the 70% threshold). A
105-round remediation campaign followed (R21 through R105), spanning
gradient isolation of the bifurcation-estimation network, BEN-input
detachment, dual-checkpoint tracking, calign and dmag re-weighting, and
fresh-start retraining with all fixes applied from step 0.
By R105 the regime-classification failure was resolved (85.00%) and the
community-embedding silhouette ratio improved by roughly 3x over R21.
However, two tests that had passed at R21 plateaued during the
remediation campaign and never recovered to threshold on the
2-community Supabase corpus.
Initial gate (R21) versus end of campaign (R105)
| Test |
Stage 2 |
R21 baseline |
R105 final |
Threshold |
R105 status |
| Community embedding (silhouette ratio) |
$1.45\times$ |
$2.18\times$ |
$\sim 6.93\times$ |
$> 1.3\times$ |
PASS |
| Divergence ratio (contested over neutral) |
$2.29\times$ |
$5.91\times$ |
$\sim 1.05$ to $1.10\times$ |
$\geq 2.0\times$ |
FAIL (plateau) |
| $\hat{r}$ vs external polarization (Pearson) |
0.88 |
0.65 |
$\sim 0.66$ |
$\geq 0.3$ |
PASS |
| Cross-topic transfer (held-out ratio) |
$1.31\times$ |
$6.10\times$ |
$\sim 1.03$ to $1.04\times$ |
$> 1.3\times$ |
FAIL (plateau) |
| Regime classification accuracy |
85.00% |
47.00% |
85.00% |
$\geq 70$% |
PASS |
Diagnosis and pivot
The diagnosis was that the 2-community Supabase corpus was too sparse to
support discriminative divergence-vector training at the scale needed
for a frozen-backbone integration. The two plateaued tests measure
contested-over-neutral norm ratios on the MAH divergence vectors; once
the supervised-contrastive objective reached its data ceiling, no
further architectural change moved them.
That diagnosis triggered the data-first pivot to a denser corpus (the
35-community Reddit Discourse Corpus, ~1M training samples) and a
larger frozen backbone (Qwen 2.5-7B). The Stage 3 Scalable line that
followed (v3 through v8a) is the production form of that pivot. The
adapter released in this package (v8a) inherits the validated geometry
(community embedding, polarization estimation, regime classification)
and re-establishes the divergence-norm contrast on the denser corpus
through a gradient-isolated adapter rather than a from-scratch hybrid
model. The inject-back arm of v8a remains under-developed and is the
central open problem identified in §5.1 and §6.3 of paper.pdf.
Cross-stage capability summary
| Capability |
Stage 1 |
Stage 2 |
Stage 3 Phase 1 (R105) |
Comment |
| Subspace specialization |
$\checkmark$ |
--- |
--- |
Stage-1-only test |
| Community embedding |
$\checkmark$ ($3.28\times$) |
$\checkmark$ ($1.45\times$) |
$\checkmark$ ($\sim 6.93\times$) |
Improves with backbone + remediation |
| Divergence tracking |
$\checkmark$ ($\rho = 0.82$) |
$\checkmark$ ($2.29\times$) |
plateau ($\sim 1.05$ to $1.10\times$) |
Data-ceiling on Supabase corpus |
| Polarization estimation |
$\checkmark$ ($\rho = 0.82$) |
$\checkmark$ ($r = 0.88$) |
$\checkmark$ ($r \approx 0.66$) |
Modest regression |
| Bifurcation detection |
$\checkmark$ (100%) |
$\checkmark$ (85%) |
$\checkmark$ (85%, R105) |
Recovered through remediation |
| Cross-topic transfer |
--- |
$\checkmark$ ($1.31\times$) |
plateau ($\sim 1.03$ to $1.04\times$) |
Data-ceiling on Supabase corpus |
Provenance
The Stage 1 and Stage 2 numbers were produced by the standalone SRT
architecture (21M trainable parameters) on synthetic and curated
news-corpus data. The Stage 3 Phase 1 numbers were produced by the
hybrid configuration (frozen TinyLlama-1.1B backbone plus the same
SRT modules) across 105 training rounds (R21 through R105). The v8a
adapter released in this package takes the program forward to a frozen
Qwen 2.5-7B backbone and the 35-community Reddit Discourse Corpus with
a re-engineered, gradient-isolated adapter (14.5M trainable),
preserving the validated capabilities and improving validation
cross-entropy from 2.71 (no-adapter baseline) to 2.63 (v8a). See the
v3 through v8a results in §5 of paper.pdf and the lineage discussion
in §1.1.5 and §2.0 of paper.pdf.