Data

What's in this folder

val_200.jsonl. 200 held-out validation samples from the SRT-Adapter Reddit corpus, with per-token reflexivity (r_true) and chain-of-interpretants labels. Sufficient for smoke-testing inference and reproducing the per-passage trace artifacts.
archetypes.json. 33 hand-curated discourse archetypes used for the out-of-distribution probe (Section 5.8 of the paper). Each entry is a (label, prompt-set) pair.

Schema (`val_200.jsonl`)

One JSON object per line:

field	type	description
`text`	string	raw passage
`community_id`	int	Reddit community index (1–35)
`community_label`	string	e.g. `reddit:AskTrumpSupporters`
`r_true`	list[float]	per-token reflexivity score in [0, 1]
`chain_labels`	list[int]	per-token chain-of-interpretants supervision
`source`	string	corpus source tag
`domain`	string	coarse topical domain
`metadata`	object	original Reddit metadata (subreddit, score, etc.)

Full corpus (not redistributed here)

The full training corpus is 1,000,000 Reddit comments spanning the 35 listed communities; the held-out validation set is 100,000 samples drawn from the same schema. Neither is redistributed in this release because:

Reddit's content terms restrict bulk redistribution.
The corpus is reproducible from the public Pushshift / arctic-shift dumps using the community list and date ranges documented in the paper (Section 4).

To reproduce the training corpus:

Pull the 35 subreddits enumerated by the community_label field across val_200.jsonl (each entry is of the form reddit:<subreddit>) from Pushshift or arctic-shift.
Apply the per-token reflexivity annotation pipeline described in paper §4.2.
Apply the chain-of-interpretants labeling described in paper §4.2.
Write JSONL with the schema above.

A reference annotation pipeline lives in the private SRT framework repository (held back during patent and publication review). Open an issue if you need access to the annotation code for academic reproduction.

Licensing

val_200.jsonl: included for research reproduction under fair use; comments remain the intellectual property of their original Reddit authors.
archetypes.json: released under the same Apache-2.0 license as the rest of this package.

Data

What's in this folder

Schema (val_200.jsonl)

Full corpus (not redistributed here)

Licensing

Schema (`val_200.jsonl`)