Aryan047 commited on
Commit
f9e8817
·
verified ·
1 Parent(s): 61cbfa4

Deploy meme-vs-event Streamlit app

Browse files
Files changed (4) hide show
  1. Dockerfile +18 -14
  2. README.md +30 -13
  3. app.py +250 -0
  4. requirements.txt +5 -3
Dockerfile CHANGED
@@ -1,20 +1,24 @@
1
- FROM python:3.13.5-slim
2
 
3
- WORKDIR /app
 
 
 
4
 
5
- RUN apt-get update && apt-get install -y \
6
- build-essential \
7
- curl \
8
- git \
9
- && rm -rf /var/lib/apt/lists/*
10
 
11
- COPY requirements.txt ./
12
- COPY src/ ./src/
13
 
14
- RUN pip3 install -r requirements.txt
15
 
16
- EXPOSE 8501
17
 
18
- HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
19
-
20
- ENTRYPOINT ["streamlit", "run", "src/streamlit_app.py", "--server.port=8501", "--server.address=0.0.0.0"]
 
 
 
1
+ FROM python:3.11-slim
2
 
3
+ ENV PYTHONUNBUFFERED=1 \
4
+ PIP_NO_CACHE_DIR=1 \
5
+ PIP_DISABLE_PIP_VERSION_CHECK=1 \
6
+ HF_HOME=/home/user/.cache/huggingface
7
 
8
+ RUN useradd -m -u 1000 user
9
+ USER user
10
+ ENV PATH="/home/user/.local/bin:$PATH"
11
+ WORKDIR /home/user/app
 
12
 
13
+ COPY --chown=user:user requirements.txt .
14
+ RUN pip install --user --no-cache-dir -r requirements.txt
15
 
16
+ COPY --chown=user:user app.py .
17
 
18
+ EXPOSE 7860
19
 
20
+ CMD ["streamlit", "run", "app.py", \
21
+ "--server.port=7860", \
22
+ "--server.address=0.0.0.0", \
23
+ "--server.headless=true", \
24
+ "--browser.gatherUsageStats=false"]
README.md CHANGED
@@ -1,20 +1,37 @@
1
  ---
2
- title: Dynamic Event Detector
3
- emoji: 🚀
4
- colorFrom: red
5
- colorTo: red
6
  sdk: docker
7
- app_port: 8501
8
- tags:
9
- - streamlit
10
  pinned: false
11
- short_description: This model distinguishes between a "real event" and "meme"
12
- license: mit
13
  ---
14
 
15
- # Welcome to Streamlit!
16
 
17
- Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
 
18
 
19
- If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
20
- forums](https://discuss.streamlit.io).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Meme vs Real Event Classifier
3
+ colorFrom: blue
4
+ colorTo: indigo
 
5
  sdk: docker
6
+ app_port: 7860
 
 
7
  pinned: false
8
+ license: apache-2.0
 
9
  ---
10
 
11
+ # Meme vs Real Event Tweet Classifier
12
 
13
+ Streamlit demo for a fine-tuned `bert-base-uncased` model that classifies a
14
+ tweet as a **meme / low-signal post** or a **real-world event**.
15
 
16
+ The model weights live in a separate Hugging Face model repo and are loaded
17
+ at startup via `transformers.AutoModelForSequenceClassification.from_pretrained`.
18
+
19
+ ## Configure the model repo
20
+
21
+ The app reads the model id from the `MODEL_ID` environment variable, defaulting
22
+ to `Aryan047/Dynamic-event-detector`. To override in the Space UI go to
23
+ **Settings -> Variables and secrets** and set `MODEL_ID` to any other model repo.
24
+
25
+ ## Local development
26
+
27
+ ```bash
28
+ pip install -r requirements.txt
29
+ streamlit run app.py
30
+ ```
31
+
32
+ ## Files
33
+
34
+ - `app.py` - Streamlit application (single-tweet tab, batch-CSV tab)
35
+ - `requirements.txt` - runtime dependencies
36
+ - `upload_model.py` - one-shot helper to push `artifacts_meme_vs_event/bert_classifier/`
37
+ to a new Hugging Face model repo. Not used by the Space itself.
app.py ADDED
@@ -0,0 +1,250 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Streamlit Space: Meme vs Real Event tweet classifier.
2
+
3
+ Loads a fine-tuned bert-base-uncased from the Hugging Face Hub and exposes:
4
+ - Single-tweet tab: live prediction + probability bar chart
5
+ - Batch CSV tab: upload a CSV with a `text` column, download predictions
6
+
7
+ Matching preprocessing (same regex as the training notebook) is reapplied
8
+ so results mirror what the notebook produces locally.
9
+ """
10
+
11
+ from __future__ import annotations
12
+
13
+ import io
14
+ import os
15
+ import re
16
+
17
+ import numpy as np
18
+ import pandas as pd
19
+ import streamlit as st
20
+ import torch
21
+ import torch.nn.functional as F
22
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
23
+
24
+ MODEL_ID = os.environ.get("MODEL_ID", "Aryan047/Dynamic-event-detector")
25
+ MAX_LENGTH = 128
26
+ LABELS = {0: "meme", 1: "real_event"}
27
+
28
+ _URL_RE = re.compile(r"https?://\S+|www\.\S+")
29
+ _MENTION_RE = re.compile(r"@\w+")
30
+ _HASHTAG_RE = re.compile(r"#")
31
+ _NON_WORD_RE = re.compile(r"[^a-z0-9\s]")
32
+ _WS_RE = re.compile(r"\s+")
33
+
34
+
35
+ def clean_tweet(text: str) -> str:
36
+ if not isinstance(text, str):
37
+ return ""
38
+ t = text.lower()
39
+ t = _URL_RE.sub(" ", t)
40
+ t = _MENTION_RE.sub(" ", t)
41
+ t = _HASHTAG_RE.sub(" ", t)
42
+ t = _NON_WORD_RE.sub(" ", t)
43
+ t = _WS_RE.sub(" ", t).strip()
44
+ return t
45
+
46
+
47
+ @st.cache_resource(show_spinner="Loading model from Hugging Face Hub...")
48
+ def load_model(model_id: str):
49
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
50
+ model = AutoModelForSequenceClassification.from_pretrained(model_id)
51
+ model.eval()
52
+ return tokenizer, model
53
+
54
+
55
+ @torch.no_grad()
56
+ def predict_one(tokenizer, model, text: str) -> dict:
57
+ cleaned = clean_tweet(text)
58
+ if not cleaned:
59
+ return {
60
+ "label": "meme",
61
+ "confidence": 0.0,
62
+ "prob_meme": 1.0,
63
+ "prob_real_event": 0.0,
64
+ "clean_text": "",
65
+ }
66
+ enc = tokenizer(cleaned, truncation=True, max_length=MAX_LENGTH, return_tensors="pt")
67
+ probs = F.softmax(model(**enc).logits[0], dim=-1).numpy()
68
+ pred = int(np.argmax(probs))
69
+ return {
70
+ "label": LABELS[pred],
71
+ "confidence": float(probs[pred]),
72
+ "prob_meme": float(probs[0]),
73
+ "prob_real_event": float(probs[1]),
74
+ "clean_text": cleaned,
75
+ }
76
+
77
+
78
+ @torch.no_grad()
79
+ def predict_many(tokenizer, model, texts: list[str], batch_size: int = 32) -> pd.DataFrame:
80
+ cleaned = [clean_tweet(t) for t in texts]
81
+ labels, confs, p0s, p1s = [], [], [], []
82
+
83
+ progress = st.progress(0.0, text="Running predictions...")
84
+ total = max(len(cleaned), 1)
85
+
86
+ for i in range(0, len(cleaned), batch_size):
87
+ chunk = cleaned[i : i + batch_size]
88
+ empty_mask = [len(c) == 0 for c in chunk]
89
+ model_inputs = [c if c else "empty" for c in chunk]
90
+
91
+ enc = tokenizer(
92
+ model_inputs,
93
+ truncation=True,
94
+ padding=True,
95
+ max_length=MAX_LENGTH,
96
+ return_tensors="pt",
97
+ )
98
+ probs = F.softmax(model(**enc).logits, dim=-1).numpy()
99
+
100
+ for j, p in enumerate(probs):
101
+ if empty_mask[j]:
102
+ labels.append("meme")
103
+ confs.append(0.0)
104
+ p0s.append(1.0)
105
+ p1s.append(0.0)
106
+ else:
107
+ pred = int(np.argmax(p))
108
+ labels.append(LABELS[pred])
109
+ confs.append(float(p[pred]))
110
+ p0s.append(float(p[0]))
111
+ p1s.append(float(p[1]))
112
+
113
+ progress.progress(min((i + batch_size) / total, 1.0))
114
+
115
+ progress.empty()
116
+
117
+ return pd.DataFrame(
118
+ {
119
+ "text": texts,
120
+ "clean_text": cleaned,
121
+ "label": labels,
122
+ "confidence": confs,
123
+ "prob_meme": p0s,
124
+ "prob_real_event": p1s,
125
+ }
126
+ )
127
+
128
+
129
+ def render_single_tab(tokenizer, model) -> None:
130
+ st.subheader("Classify a single tweet")
131
+ st.caption("Paste any tweet-style text. Labels: `meme` or `real_event`.")
132
+
133
+ default_example = "Massive 6.5 earthquake just rocked Istanbul, buildings swaying"
134
+ text = st.text_area("Tweet text", value=default_example, height=120)
135
+
136
+ if st.button("Predict", type="primary"):
137
+ if not text.strip():
138
+ st.warning("Please enter some text.")
139
+ return
140
+
141
+ result = predict_one(tokenizer, model, text)
142
+
143
+ col1, col2 = st.columns(2)
144
+ col1.metric("Predicted label", result["label"])
145
+ col2.metric("Confidence", f"{result['confidence']:.2%}")
146
+
147
+ st.markdown("**Class probabilities**")
148
+ st.bar_chart(
149
+ pd.DataFrame(
150
+ {"probability": [result["prob_meme"], result["prob_real_event"]]},
151
+ index=["meme", "real_event"],
152
+ )
153
+ )
154
+
155
+ with st.expander("Details"):
156
+ st.write({"cleaned_text": result["clean_text"]})
157
+
158
+
159
+ def render_batch_tab(tokenizer, model) -> None:
160
+ st.subheader("Classify a CSV of tweets")
161
+ st.caption("Upload a CSV with a `text` column. Predictions are added as new columns.")
162
+
163
+ uploaded = st.file_uploader("CSV file", type=["csv"])
164
+ if uploaded is None:
165
+ st.info("Waiting for a CSV upload...")
166
+ return
167
+
168
+ try:
169
+ df = pd.read_csv(uploaded)
170
+ except Exception as exc:
171
+ st.error(f"Could not read CSV: {exc}")
172
+ return
173
+
174
+ if "text" not in df.columns:
175
+ st.error(f"CSV must contain a `text` column. Found: {list(df.columns)}")
176
+ return
177
+
178
+ max_rows = 5000
179
+ if len(df) > max_rows:
180
+ st.warning(f"CSV has {len(df)} rows. Truncating to first {max_rows} for the demo.")
181
+ df = df.head(max_rows).copy()
182
+
183
+ st.write(f"Loaded {len(df)} rows. Preview:")
184
+ st.dataframe(df.head(5))
185
+
186
+ if st.button("Run batch prediction", type="primary"):
187
+ out = predict_many(tokenizer, model, df["text"].tolist())
188
+ merged = pd.concat(
189
+ [df.reset_index(drop=True).drop(columns=["text"]), out.reset_index(drop=True)],
190
+ axis=1,
191
+ )
192
+
193
+ st.success(f"Classified {len(merged)} tweets.")
194
+ st.dataframe(merged.head(50))
195
+
196
+ counts = merged["label"].value_counts().reindex(["meme", "real_event"], fill_value=0)
197
+ st.markdown("**Label distribution**")
198
+ st.bar_chart(counts)
199
+
200
+ buf = io.StringIO()
201
+ merged.to_csv(buf, index=False)
202
+ st.download_button(
203
+ label="Download predictions CSV",
204
+ data=buf.getvalue(),
205
+ file_name="meme_vs_event_predictions.csv",
206
+ mime="text/csv",
207
+ )
208
+
209
+
210
+ def main() -> None:
211
+ st.set_page_config(
212
+ page_title="Meme vs Real Event Classifier",
213
+ page_icon="",
214
+ layout="centered",
215
+ )
216
+
217
+ st.title("Meme vs Real Event Tweet Classifier")
218
+ st.caption(
219
+ f"Fine-tuned `bert-base-uncased` loaded from "
220
+ f"[`{MODEL_ID}`](https://huggingface.co/{MODEL_ID})."
221
+ )
222
+
223
+ tokenizer, model = load_model(MODEL_ID)
224
+
225
+ single_tab, batch_tab, about_tab = st.tabs(["Single tweet", "Batch CSV", "About"])
226
+
227
+ with single_tab:
228
+ render_single_tab(tokenizer, model)
229
+
230
+ with batch_tab:
231
+ render_batch_tab(tokenizer, model)
232
+
233
+ with about_tab:
234
+ st.markdown(
235
+ """
236
+ **Pipeline**: tweets were embedded with `all-mpnet-base-v2`, clustered with
237
+ BERTopic, cross-checked against the GDELT DOC 2.0 API with a lifespan-aware
238
+ rule, and the resulting `(tweet, label)` pairs were used to fine-tune
239
+ `bert-base-uncased`.
240
+
241
+ - **Input**: raw tweet text
242
+ - **Preprocessing**: lowercase, strip URLs / mentions / hashtag chars / non-word
243
+ - **Max length**: 128 tokens
244
+ - **Labels**: `0 = meme`, `1 = real_event`
245
+ """
246
+ )
247
+
248
+
249
+ if __name__ == "__main__":
250
+ main()
requirements.txt CHANGED
@@ -1,3 +1,5 @@
1
- altair
2
- pandas
3
- streamlit
 
 
 
1
+ streamlit>=1.36.0
2
+ torch>=2.1.0
3
+ transformers>=4.40.0
4
+ pandas>=2.0.0
5
+ numpy>=1.24.0