thomasrenault
/

topic

+---
+license: mit
+datasets:
+- thomasrenault/us_tweet_speech_congress
+language:
+- en
+tags:
+- text-classification
+- multi-label-classification
+- topic-classification
+- political-text
+- tweets
+- distilbert
+- active-learning
+pipeline_tag: text-classification
+---
+A multi-label political topic classifier fine-tuned on US political tweets and congressional speeches.
+Built on `distilbert-base-uncased` using an **active learning** pipeline with GPT-4o-mini annotation.
+## Labels
+The model predicts **7 independent topic indicators** (sigmoid, threshold 0.5).
+A document can belong to **zero or multiple topics simultaneously**.
+| Label | Description |
+|---|---|
+| `abortion` | Abortion rights and reproductive policy |
+| `democracy` | Elections, voting rights, democratic institutions |
+| `gender equality` | Gender rights, feminism, LGBTQ+ issues |
+| `gun control` | Firearms regulation, Second Amendment |
+| `immigration` | Immigration policy, border control, citizenship |
+| `tax and inequality` | Tax policy, economic inequality, redistribution |
+| `trade` | Trade policy, tariffs, international commerce |
+## Training
+| Setting | Value |
+|---|---|
+| Base model | `distilbert-base-uncased` |
+| Architecture | `DistilBertForSequenceClassification` (multi-label) |
+| Problem type | `multi_label_classification` |
+| Training data | ~100,000 labeled documents (early checkpoint) |
+| Annotation | GPT-4o-mini (temperature=0) via OpenAI Batch API |
+| Strategy | Active learning (uncertainty sampling) |
+| Seed size | 1,000 documents (random) |
+| AL query size | 25,000 documents / round |
+| Epochs (seed) | 4 |
+| Epochs (AL) | 2 (warm-start) |
+| Learning rate | 2e-5 |
+| Batch size | 16 |
+| Max length | 512 tokens |
+| Classification threshold | 0.5 |
+| Domain | US political tweets and congressional floor speeches |
+## Usage
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+model_id = "thomasrenault/topic"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model     = AutoModelForSequenceClassification.from_pretrained(model_id)
+model.eval()
+TOPICS    = ["abortion", "democracy", "gender equality", "gun control",
+             "immigration", "tax and inequality", "trade"]
+THRESHOLD = 0.5
+def predict(text):
+    enc = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
+    with torch.no_grad():
+        probs = torch.sigmoid(model(**enc).logits).squeeze().tolist()
+    matched = [t for t, p in zip(TOPICS, probs) if p >= THRESHOLD]
+    return matched or ["other topic"]
+print(predict("We need stronger border security and immigration reform."))
+# ["immigration"]
+print(predict("Tax cuts for the wealthy only increase inequality in America."))
+# ["tax and inequality"]
+```
+## Intended Use
+- Academic research on political agenda-setting and issue salience
+- Topic trend analysis across congressional speeches and social media
+- Cross-platform comparison of elite vs. citizen political communication
+## Limitations
+- Trained on **US English political text** — may not generalise to other political systems or languages
+- Annotation by GPT-4o-mini introduces model-specific biases in topic boundaries
+- Early training checkpoint (round 0, ~1,600 documents) — performance will improve as active learning progresses
+- Topics reflect the specific research agenda of the parent project; other salient topics (healthcare, climate, etc.) are out of scope
+## Citation
+If you use this model, please cite:
+```
+@article{algan2026emotions,
+  title={Emotions and policy views},
+  author={Algan, Y, Davoine, E., Renault, T., and Stantcheva, S},
+  year={2026}
+}
+```