thomasrenault commited on
Commit
8223df6
·
verified ·
1 Parent(s): 1fe8d6e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +110 -0
README.md ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - thomasrenault/us_tweet_speech_congress
5
+ language:
6
+ - en
7
+ tags:
8
+ - text-classification
9
+ - multi-label-classification
10
+ - topic-classification
11
+ - political-text
12
+ - tweets
13
+ - distilbert
14
+ - active-learning
15
+ pipeline_tag: text-classification
16
+ ---
17
+
18
+
19
+ A multi-label political topic classifier fine-tuned on US political tweets and congressional speeches.
20
+ Built on `distilbert-base-uncased` using an **active learning** pipeline with GPT-4o-mini annotation.
21
+
22
+ ## Labels
23
+
24
+ The model predicts **7 independent topic indicators** (sigmoid, threshold 0.5).
25
+ A document can belong to **zero or multiple topics simultaneously**.
26
+
27
+ | Label | Description |
28
+ |---|---|
29
+ | `abortion` | Abortion rights and reproductive policy |
30
+ | `democracy` | Elections, voting rights, democratic institutions |
31
+ | `gender equality` | Gender rights, feminism, LGBTQ+ issues |
32
+ | `gun control` | Firearms regulation, Second Amendment |
33
+ | `immigration` | Immigration policy, border control, citizenship |
34
+ | `tax and inequality` | Tax policy, economic inequality, redistribution |
35
+ | `trade` | Trade policy, tariffs, international commerce |
36
+
37
+
38
+ ## Training
39
+
40
+ | Setting | Value |
41
+ |---|---|
42
+ | Base model | `distilbert-base-uncased` |
43
+ | Architecture | `DistilBertForSequenceClassification` (multi-label) |
44
+ | Problem type | `multi_label_classification` |
45
+ | Training data | ~100,000 labeled documents (early checkpoint) |
46
+ | Annotation | GPT-4o-mini (temperature=0) via OpenAI Batch API |
47
+ | Strategy | Active learning (uncertainty sampling) |
48
+ | Seed size | 1,000 documents (random) |
49
+ | AL query size | 25,000 documents / round |
50
+ | Epochs (seed) | 4 |
51
+ | Epochs (AL) | 2 (warm-start) |
52
+ | Learning rate | 2e-5 |
53
+ | Batch size | 16 |
54
+ | Max length | 512 tokens |
55
+ | Classification threshold | 0.5 |
56
+ | Domain | US political tweets and congressional floor speeches |
57
+
58
+ ## Usage
59
+
60
+ ```python
61
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
62
+ import torch
63
+
64
+ model_id = "thomasrenault/topic"
65
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
66
+ model = AutoModelForSequenceClassification.from_pretrained(model_id)
67
+ model.eval()
68
+
69
+ TOPICS = ["abortion", "democracy", "gender equality", "gun control",
70
+ "immigration", "tax and inequality", "trade"]
71
+ THRESHOLD = 0.5
72
+
73
+ def predict(text):
74
+ enc = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
75
+ with torch.no_grad():
76
+ probs = torch.sigmoid(model(**enc).logits).squeeze().tolist()
77
+ matched = [t for t, p in zip(TOPICS, probs) if p >= THRESHOLD]
78
+ return matched or ["other topic"]
79
+
80
+ print(predict("We need stronger border security and immigration reform."))
81
+ # ["immigration"]
82
+
83
+ print(predict("Tax cuts for the wealthy only increase inequality in America."))
84
+ # ["tax and inequality"]
85
+ ```
86
+
87
+ ## Intended Use
88
+
89
+ - Academic research on political agenda-setting and issue salience
90
+ - Topic trend analysis across congressional speeches and social media
91
+ - Cross-platform comparison of elite vs. citizen political communication
92
+
93
+ ## Limitations
94
+
95
+ - Trained on **US English political text** — may not generalise to other political systems or languages
96
+ - Annotation by GPT-4o-mini introduces model-specific biases in topic boundaries
97
+ - Early training checkpoint (round 0, ~1,600 documents) — performance will improve as active learning progresses
98
+ - Topics reflect the specific research agenda of the parent project; other salient topics (healthcare, climate, etc.) are out of scope
99
+
100
+ ## Citation
101
+
102
+ If you use this model, please cite:
103
+
104
+ ```
105
+ @article{algan2026emotions,
106
+ title={Emotions and policy views},
107
+ author={Algan, Y, Davoine, E., Renault, T., and Stantcheva, S},
108
+ year={2026}
109
+ }
110
+ ```