srt-adapter-v8a / data /DATA.md
RiverRider's picture
Initial release: SRT-Adapter v8a (peer-review distribution)
aa2d4f1 verified

Data

What's in this folder

  • val_200.jsonl. 200 held-out validation samples from the SRT-Adapter Reddit corpus, with per-token reflexivity (r_true) and chain-of-interpretants labels. Sufficient for smoke-testing inference and reproducing the per-passage trace artifacts.
  • archetypes.json. 33 hand-curated discourse archetypes used for the out-of-distribution probe (Section 5.8 of the paper). Each entry is a (label, prompt-set) pair.

Schema (val_200.jsonl)

One JSON object per line:

field type description
text string raw passage
community_id int Reddit community index (1–35)
community_label string e.g. reddit:AskTrumpSupporters
r_true list[float] per-token reflexivity score in [0, 1]
chain_labels list[int] per-token chain-of-interpretants supervision
source string corpus source tag
domain string coarse topical domain
metadata object original Reddit metadata (subreddit, score, etc.)

Full corpus (not redistributed here)

The full training corpus is 1,000,000 Reddit comments spanning the 35 listed communities; the held-out validation set is 100,000 samples drawn from the same schema. Neither is redistributed in this release because:

  1. Reddit's content terms restrict bulk redistribution.
  2. The corpus is reproducible from the public Pushshift / arctic-shift dumps using the community list and date ranges documented in the paper (Section 4).

To reproduce the training corpus:

  1. Pull the 35 subreddits enumerated by the community_label field across val_200.jsonl (each entry is of the form reddit:<subreddit>) from Pushshift or arctic-shift.
  2. Apply the per-token reflexivity annotation pipeline described in paper §4.2.
  3. Apply the chain-of-interpretants labeling described in paper §4.2.
  4. Write JSONL with the schema above.

A reference annotation pipeline lives in the private SRT framework repository (held back during patent and publication review). Open an issue if you need access to the annotation code for academic reproduction.

Licensing

  • val_200.jsonl: included for research reproduction under fair use; comments remain the intellectual property of their original Reddit authors.
  • archetypes.json: released under the same Apache-2.0 license as the rest of this package.