Data
What's in this folder
val_200.jsonl. 200 held-out validation samples from the SRT-Adapter Reddit corpus, with per-token reflexivity (r_true) and chain-of-interpretants labels. Sufficient for smoke-testing inference and reproducing the per-passage trace artifacts.archetypes.json. 33 hand-curated discourse archetypes used for the out-of-distribution probe (Section 5.8 of the paper). Each entry is a (label, prompt-set) pair.
Schema (val_200.jsonl)
One JSON object per line:
| field | type | description |
|---|---|---|
text |
string | raw passage |
community_id |
int | Reddit community index (1–35) |
community_label |
string | e.g. reddit:AskTrumpSupporters |
r_true |
list[float] | per-token reflexivity score in [0, 1] |
chain_labels |
list[int] | per-token chain-of-interpretants supervision |
source |
string | corpus source tag |
domain |
string | coarse topical domain |
metadata |
object | original Reddit metadata (subreddit, score, etc.) |
Full corpus (not redistributed here)
The full training corpus is 1,000,000 Reddit comments spanning the 35 listed communities; the held-out validation set is 100,000 samples drawn from the same schema. Neither is redistributed in this release because:
- Reddit's content terms restrict bulk redistribution.
- The corpus is reproducible from the public Pushshift / arctic-shift dumps using the community list and date ranges documented in the paper (Section 4).
To reproduce the training corpus:
- Pull the 35 subreddits enumerated by the
community_labelfield acrossval_200.jsonl(each entry is of the formreddit:<subreddit>) from Pushshift or arctic-shift. - Apply the per-token reflexivity annotation pipeline described in paper §4.2.
- Apply the chain-of-interpretants labeling described in paper §4.2.
- Write JSONL with the schema above.
A reference annotation pipeline lives in the private SRT framework repository (held back during patent and publication review). Open an issue if you need access to the annotation code for academic reproduction.
Licensing
val_200.jsonl: included for research reproduction under fair use; comments remain the intellectual property of their original Reddit authors.archetypes.json: released under the same Apache-2.0 license as the rest of this package.