Spaces:

Binxk
/

goon

Sleeping

App Files Files Community

goon / REPLICATION.md

Binx

Initial commit: analysis app, deployment config, UI improvements

da605e9 26 days ago

preview code

raw

history blame contribute delete

15.3 kB

metadata

editor_options:
  markdown:
    wrap: 72

Replication Guide — Grasping 'Gooning'

An observational analysis of Reddit gooning communities via BERTopic topic modelling.

Project summary

This study empirically analyses publicly available Reddit subreddits dedicated to gooning (prolonged, trancelike masturbation). The analysis pipeline combines:

Descriptive statistics — demographics (age, gender, sexuality from user flair), word frequencies, and substance mentions across 30 subreddits
BERTopic topic modelling — semantic clustering of ~2.35M documents (posts and comments) to identify the major themes and discourse patterns

Full corpus: 30 subreddits, ~22M raw records, 2019–2025

Repository structure

Goon/
├── REPLICATION.md                 ← this file
├── README.md                      ← subreddit list
├── Goon.Rproj                     ← RStudio project (sets working directory)
├── data_prep.qmd                  ← Step 1: R — ingest CSVs, clean, export Parquet
├── goon_analysis.qmd              ← Step 2: R — descriptive analyses
├── goon_topic_analysis.qmd        ← Step 4: R — visualise BERTopic outputs
├── data/
│   ├── *.csv                      ← raw Reddit exports (one file per subreddit per year)
│   ├── GOONED_comments.csv        ← pre-concatenated GOONED comments (all years)
│   ├── GOONED_submissions.csv     ← pre-concatenated GOONED submissions (all years)
│   ├── comments.parquet           ← output of data_prep.qmd (18.2M rows)
│   ├── posts.parquet              ← output of data_prep.qmd (3.8M rows)
│   ├── corpus_clean.parquet       ← output of topic_analysis.qmd (modelling corpus)
│   └── corpus_deleted.parquet     ← deleted/removed rows (tracked separately)
├── topic analysis/
│   ├── topic_analysis.qmd         ← Step 3: Python — BERTopic pipeline
│   ├── topic_api_labeling.qmd     ← Step 3b: Python — optional API-based topic labelling
│   ├── build_topic_results_summary.py  ← helper: generate an HTML summary from run outputs
│   ├── run_topic_analysis.sh      ← shell wrapper that sets env vars and calls quarto
│   ├── README.md                  ← detailed pipeline documentation
│   ├── MEMORY_MANAGEMENT.md       ← strategies for large-corpus runs
│   └── runs/
│       └── full_run_v1/           ← complete output of the full corpus run
│           ├── bertopic/          ← saved BERTopic models + doc-topic assignments
│           ├── topics/            ← keyword tables, summaries, evaluation, API labels
│           └── figures/           ← generated charts
└── .venv/                         ← Python virtual environment (created per instructions below)

System requirements

Hardware

Component	Minimum	Recommended
RAM	32 GB	64 GB (full corpus run; see MEMORY_MANAGEMENT.md)
Disk	50 GB free	100 GB free
CPU	Any modern multi-core	8+ cores for embedding generation
GPU	Not required	Optional — speeds up embedding by ~10×

The full corpus run was executed on a Linux VM with 96 GB RAM. A local Mac with 24 GB will work for the pilot/sample runs but may struggle with the full corpus embedding step.

Software

Tool	Version tested	Notes
R	≥ 4.3	For data_prep.qmd and goon_analysis.qmd
RStudio / Quarto CLI	≥ 1.4	To render .qmd files
Python	≥ 3.10	For topic_analysis.qmd
Quarto	≥ 1.4	Installed with RStudio or standalone

Step-by-step replication

Step 0: Clone / obtain the repository

The data/ folder containing raw CSVs is required. These are large files and are not distributed via git — they must be present locally.

Open Goon.Rproj in RStudio. This sets the working directory to the project root so that here::here() paths resolve correctly.

Step 1: R data preparation (`data_prep.qmd`)

Purpose: Reads all raw CSVs, combines them into unified data frames, applies minimal cleaning, and exports data/comments.parquet and data/posts.parquet.

R packages required:

install.packages(c("dplyr", "tidyr", "tibble", "purrr",
                   "data.table", "arrow", "here"))

Run:

Open data_prep.qmd in RStudio and click Render (or run all chunks in order).

Alternatively, from the terminal:

cd /Users/bkot7579/Desktop/Goon
quarto render data_prep.qmd

Expected outputs: - data/comments.parquet — ~18.2M rows, ~528 MB - data/posts.parquet — ~3.8M rows, ~186 MB

Time estimate: 15–45 minutes depending on RAM and I/O speed (the GOONED CSV files alone are ~7 GB).

Step 2: R descriptive analysis (`goon_analysis.qmd`)

Purpose: Demographic analysis (r/GOONEDmeetup flair), word frequency analysis, substance mention counts.

R packages required:

install.packages(c("dplyr", "tidyr", "tibble", "purrr", "ggplot2",
                   "stringr", "here", "e1071", "tidytext",
                   "data.table", "arrow"))

Run:

Open goon_analysis.qmd in RStudio and click Render.

Prerequisites: data/comments.parquet and data/posts.parquet must exist (Step 1).

Expected outputs: - Rendered HTML with plots embedded - In-memory: word count tables, substance mention counts, demographic counts

Step 3: Python BERTopic topic modelling (`topic analysis/topic_analysis.qmd`)

Purpose: Embeds all documents with all-MiniLM-L6-v2, runs HDBSCAN topic modelling via BERTopic, reduces topics using c-TF-IDF agglomerative clustering, evaluates models, and exports all topic outputs.

3a. Create the Python virtual environment

cd /Users/bkot7579/Desktop/Goon
python3 -m venv .venv
source .venv/bin/activate

pip install bertopic umap-learn hdbscan sentence-transformers \
            scikit-learn pandas pyarrow quarto

Key package versions (tested):

Package	Version
bertopic	0.16.x
umap-learn	0.5.x
hdbscan	0.8.x
sentence-transformers	2.x
scikit-learn	1.x
pandas	2.x
pyarrow	14.x

3b. Run the pipeline

Pilot run (200k documents — recommended first):

cd /Users/bkot7579/Desktop/Goon
./topic\ analysis/run_topic_analysis.sh --max-docs 200000 --run-tag pilot_200k

Outputs land in topic analysis/runs/pilot_200k/.

Full corpus run (~2.35M cleaned documents after filtering):

cd /Users/bkot7579/Desktop/Goon
./topic\ analysis/run_topic_analysis.sh --run-tag full_run_v1

Warning: The full run requires ~64 GB RAM for the embedding + UMAP stages. See topic analysis/MEMORY_MANAGEMENT.md for strategies if RAM is limited.

Pipeline stages (automatically executed in order):

Data ingestion — loads data/posts.parquet + data/comments.parquet
Light cleaning — deduplication, URL/username/subreddit anonymisation, markdown stripping; deleted/removed rows saved separately to data/corpus_deleted.parquet
Embedding — all-MiniLM-L6-v2 (384-dim), batch_size=512, saved as shards; skips already-generated shards on resume
UMAP — pre-computed once, reused across all HDBSCAN configs
BERTopic — 6 configurations: min_cluster_size ∈ {50, 100, 200} × method ∈ {eom, leaf}
Topic reduction — c-TF-IDF agglomerative clustering to 100, 50, and 25 topics
Evaluation — NPMI coherence, topic diversity, outlier rates
Export — keyword tables, representative docs, summary CSVs

Reproducibility: Random seed = 42 throughout. A reproducibility_log.json is written to the run folder with all settings and package versions.

3c. Optional API-based topic labelling (`topic_api_labeling.qmd`)

Sends reduced topic summaries (keywords + representative texts) to an LLM API to generate human-readable labels. Does NOT send raw corpus text.

Set your API key, then render:

export OPENAI_API_KEY="your-key-here"
quarto render "topic analysis/topic_api_labeling.qmd"

Outputs are saved to topic analysis/runs/<run-tag>/topics/api/.

Step 4: R topic visualisations (`goon_topic_analysis.qmd`)

Purpose: Loads the CSV/Parquet outputs from the BERTopic run and produces exploratory visualisations: topic size bar chart, subreddit × topic heatmap, topic prevalence over time, post vs comment split, representative documents.

R packages required:

install.packages(c("dplyr", "tidyr", "tibble", "purrr",
                   "ggplot2", "stringr", "here", "arrow"))

Run:

quarto render goon_topic_analysis.qmd

Prerequisites: Step 3 must have completed and outputs must exist under topic analysis/runs/full_run_v1/topics/.

Execution order summary

Step 1  →  data_prep.qmd            (R)       ~30 min
Step 2  →  goon_analysis.qmd        (R)       ~10 min
Step 3  →  topic_analysis.qmd       (Python)  ~6–48 hours (full corpus)
Step 3b →  topic_api_labeling.qmd   (Python)  ~5 min + API cost (optional)
Step 4  →  goon_topic_analysis.qmd  (R)       ~2 min

Steps 2 and 3 are independent of each other and can run in parallel.

Key modelling decisions

Decision	Choice	Rationale
Embedding model	`all-MiniLM-L6-v2`	Fast, runs on CPU, 384-dim sufficient for topic structure
UMAP n_neighbors	15	BERTopic default; balances local vs global structure
UMAP n_components	5	Low enough for HDBSCAN to work well
HDBSCAN min_cluster_size	50, 100, 200	Tested all three; mcs=100 eom selected as reference
Topic reduction method	c-TF-IDF agglomerative	Merges semantically similar topics rather than splitting clusters
Reduction targets	100, 50, 25	50 selected for reporting (NPMI=0.27, diversity=0.74)
Preprocessing	Minimal	Preserves informal language and slang; CountVectorizer handles casing/stopwords
Random seed	42	Applied to UMAP, HDBSCAN sampling, and document cap sampling

Known issue: r/GOONED missing from BERTopic results

r/GOONED is the dominant subreddit in the corpus by a large margin:

	Count
r/GOONED posts	2,765,119
r/GOONED comments	15,493,075
r/GOONED total	18,258,194
Full cleaned corpus	22,011,124
r/GOONED share	82.9%

Despite this, r/GOONED is entirely absent from topic analysis/runs/full_run_v1/ outputs. The cloud VM that ran the BERTopic pipeline did not have the GOONED_comments.csv and GOONED_submissions.csv files available (most likely because their combined size of ~7 GB made transfer impractical), and corpus_clean.parquet on the VM was generated without them.

Consequence: All topic modelling results represent 29 subreddits (3.75M records) rather than the full 30-subreddit corpus (22M records). Topic proportions, dominant themes, and subreddit distribution tables are therefore not representative of the full corpus.

To fix: Ensure the GOONED_*.csv files (or the pre-built comments.parquet / posts.parquet) are available on the compute environment, then re-run:

./topic\ analysis/run_topic_analysis.sh --run-tag full_run_v2

Note: the R-side analyses (goon_analysis.qmd) are not affected by this issue — they read directly from the local data/posts.parquet and data/comments.parquet files, which do contain r/GOONED data.

Known limitations

Outlier rate: 37.3% of documents are assigned to the outlier topic (-1) by HDBSCAN. This is typical for short-text social media corpora. Outliers are excluded from topic analyses but are retained in the corpus parquet files.
Flair ambiguity: Gender/sexuality classifiers rely on voluntarily set flair strings. Flair adoption is uneven across subreddits and users, introducing selection bias. Abbreviations like \\bt\\b may match unintended strings (e.g. US state TX).
Deleted content: Posts and comments marked [deleted] or [removed] are excluded from modelling but counted separately. These may disproportionately represent controversial content.
Temporal coverage: Coverage varies by subreddit. Some communities only appear in later years (2024–2025); others span the full 2019–2025 window.
Platform-specific norms: Moderation rules, flair conventions, and posting styles differ across subreddits, which may shape topics in ways that are not generalisable.
Unobserved participants: Lurkers, banned users, and deleted accounts are not captured.

Output files (full_run_v1)

File	Description
`topics/final_topic_summary.csv`	50 reduced topics with size, keywords, representative texts
`topics/final_model_comparison.csv`	Coherence, diversity, outlier rate for all 9 runs
`topics/evaluation_table.csv`	Same as above, alternate format
`topics/bertopic_run_summary.csv`	Initial topic counts across 6 HDBSCAN configurations
`topics/topic_by_subreddit.csv`	Topic × subreddit document counts
`topics/topic_by_month.csv`	Topic × month document counts
`topics/topic_by_doctype.csv`	Topic × doc type (post/comment)
`topics/preprocessing_decisions.json`	Logged cleaning decisions
`topics/api/`	API-generated labels, summaries, category annotations
`bertopic/`	Saved BERTopic models + doc-topic parquet files for all 6 initial runs

Contacts and citation

This analysis was conducted as part of a preliminary empirical study of online gooning communities. If replicating, please cite the original study and note the random seed, embedding model, and reduction target used.

Replication Guide — Grasping 'Gooning'

Project summary

Repository structure

System requirements

Hardware

Software

Step-by-step replication

Step 0: Clone / obtain the repository

Step 1: R data preparation (data_prep.qmd)

Step 2: R descriptive analysis (goon_analysis.qmd)

Step 3: Python BERTopic topic modelling (topic analysis/topic_analysis.qmd)

3a. Create the Python virtual environment

3b. Run the pipeline

3c. Optional API-based topic labelling (topic_api_labeling.qmd)

Step 4: R topic visualisations (goon_topic_analysis.qmd)

Execution order summary

Key modelling decisions

Known issue: r/GOONED missing from BERTopic results

Known limitations

Output files (full_run_v1)

Contacts and citation

Step 1: R data preparation (`data_prep.qmd`)

Step 2: R descriptive analysis (`goon_analysis.qmd`)

Step 3: Python BERTopic topic modelling (`topic analysis/topic_analysis.qmd`)

3c. Optional API-based topic labelling (`topic_api_labeling.qmd`)

Step 4: R topic visualisations (`goon_topic_analysis.qmd`)