Spaces:

ProstoPetro
/

Article-classifier

Sleeping

App Files Files Community

Pyotr Lisov commited on Apr 8

Commit

70b2ea0

1 Parent(s): ae1c8b3

Add article classifier app

Browse files

Files changed (13) hide show

.streamlit/config.toml +7 -0
README.md +114 -13
app.py +204 -0
artifacts/large_model/best_model/config.json +52 -0
artifacts/large_model/best_model/model.safetensors +3 -0
artifacts/large_model/best_model/tokenizer.json +0 -0
artifacts/large_model/best_model/tokenizer_config.json +14 -0
artifacts/large_model/best_model/training_args.bin +3 -0
artifacts/large_model/metrics.json +22 -0
configs/app_config.json +11 -0
data/processed_large/label_mapping.json +12 -0
inference.py +156 -0
requirements.txt +5 -3

.streamlit/config.toml ADDED Viewed

	@@ -0,0 +1,7 @@

+[theme]
+base = "light"
+primaryColor = "#1D4ED8"
+backgroundColor = "#F7FAFC"
+secondaryBackgroundColor = "#E8F1F8"
+textColor = "#102A43"
+font = "sans serif"

README.md CHANGED Viewed

@@ -1,20 +1,121 @@
 ---
-title: Article Classifier
-emoji: 🚀
-colorFrom: red
-colorTo: red
-sdk: docker
-app_port: 8501
-tags:
-- streamlit
 pinned: false
-short_description: A small bert-based model for classifying articles from Arxiv
 license: mit
 ---
-# Welcome to Streamlit!
-Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
-If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
-forums](https://discuss.streamlit.io).

 ---
+title: arXiv Topic Classifier
+emoji: 📚
+colorFrom: blue
+colorTo: green
+sdk: streamlit
+sdk_version: 1.33.0
+app_file: app.py
 pinned: false
 license: mit
+short_description: Transformer-powered topic classification for arXiv papers
 ---
+# arXiv Topic Classifier
+`arXiv Topic Classifier` is a Streamlit app for classifying research papers into arXiv-style topic categories from the paper title and abstract. The interface accepts the two fields separately, supports title-only inference, and returns the smallest prefix of labels whose cumulative probability exceeds 95%.
+The project is designed as a lightweight end-to-end ML application: collect data, fine-tune a transformer classifier, package the trained model with local inference code, and expose the result through a public web interface.
+## Features
+- topic prediction from `title` and `abstract`
+- inference from `title` only when abstract is missing
+- top-95% cumulative probability output
+- full ranked list of class probabilities
+- cached model loading for faster repeated requests
+- self-contained deployment with local model weights
+## Categories
+The current model predicts 10 categories:
+- `astro-ph.GA`
+- `cond-mat.mtrl-sci`
+- `cs.CL`
+- `cs.CV`
+- `cs.RO`
+- `econ.EM`
+- `math.PR`
+- `physics.optics`
+- `q-bio.BM`
+- `quant-ph`
+## Model
+The production model is based on `distilbert-base-uncased` fine-tuned for multi-class text classification.
+Configuration:
+- max sequence length: `256`
+- epochs: `3`
+- learning rate: `2e-5`
+The model consumes a single formatted text built from the input fields:
+```text
+title: <paper title> abstract: <paper abstract>
+```
+If the abstract is missing, inference falls back to:
+```text
+title: <paper title>
+```
+## Dataset
+The dataset was collected from the arXiv API and processed into train, validation, and test splits.
+Prepared split sizes:
+- train: `3120`
+- validation: `391`
+- test: `388`
+## Metrics
+Evaluation metrics from the bundled model artifact:
+- validation accuracy: `0.8696`
+- validation macro-F1: `0.8696`
+- test accuracy: `0.8789`
+- test macro-F1: `0.8769`
+## Local Run
+Install dependencies:
+```bash
+python3 -m pip install -r requirements.txt
+```
+Start the app:
+```bash
+streamlit run app.py --server.port 8080
+```
+## Repository Layout
+- `app.py` - Streamlit UI
+- `inference.py` - model loading and inference pipeline
+- `configs/app_config.json` - runtime configuration
+- `artifacts/large_model/best_model/` - trained model weights and tokenizer
+- `artifacts/large_model/metrics.json` - evaluation metrics
+- `data/processed_large/label_mapping.json` - label mapping used by inference
+## Deployment
+This repository is prepared for Hugging Face Spaces with `sdk: streamlit`. The app runs directly from local artifacts and does not require downloading model weights at runtime.
+## Example Use Cases
+- quick topic tagging for arXiv drafts
+- sanity-checking paper metadata before submission
+- exploring how transformer classifiers separate neighboring scientific fields
+## Notes
+- Predictions are limited by the training taxonomy and dataset coverage.
+- The model is intended as a lightweight demo application, not a substitute for expert annotation.

app.py ADDED Viewed

	@@ -0,0 +1,204 @@

+from __future__ import annotations
+import json
+from pathlib import Path
+from typing import Any
+import streamlit as st
+from inference import ArticleClassifier, ClassifierError
+PROJECT_DIR = Path(__file__).resolve().parent
+CONFIG_PATH = PROJECT_DIR / "configs" / "app_config.json"
+METRICS_PATH = PROJECT_DIR / "artifacts" / "large_model" / "metrics.json"
+DEFAULT_APP_CONFIG = {
+    "model_dir": "artifacts/large_model/best_model",
+    "labels_path": "data/processed_large/label_mapping.json",
+    "max_length": 256,
+    "coverage_threshold": 0.95,
+    "model_name": "distilbert-base-uncased",
+    "page_title": "arXiv Topic Classifier",
+    "page_icon": "📚",
+    "example_title": "Learning-based Visual Navigation for Mobile Robots",
+    "example_abstract": (
+        "We present a transformer-based navigation system that uses camera observations "
+        "and scene understanding to plan robust trajectories for indoor mobile robots."
+    ),
+}
+def load_app_config() -> dict[str, Any]:
+    if not CONFIG_PATH.exists():
+        return DEFAULT_APP_CONFIG.copy()
+    with CONFIG_PATH.open("r", encoding="utf-8") as fh:
+        config = json.load(fh)
+    merged_config = DEFAULT_APP_CONFIG.copy()
+    merged_config.update(config)
+    return merged_config
+APP_CONFIG = load_app_config()
+MODEL_DIR = PROJECT_DIR / str(APP_CONFIG["model_dir"])
+LABELS_PATH = PROJECT_DIR / str(APP_CONFIG["labels_path"])
+MAX_LENGTH = int(APP_CONFIG["max_length"])
+COVERAGE_THRESHOLD = float(APP_CONFIG["coverage_threshold"])
+st.set_page_config(
+    page_title=str(APP_CONFIG["page_title"]),
+    page_icon=str(APP_CONFIG["page_icon"]),
+    layout="centered",
+)
+@st.cache_resource
+def load_classifier() -> ArticleClassifier:
+    return ArticleClassifier(
+        model_dir=MODEL_DIR,
+        labels_path=LABELS_PATH,
+        max_length=MAX_LENGTH,
+    )
+@st.cache_data
+def load_metrics() -> dict | None:
+    if not METRICS_PATH.exists():
+        return None
+    import json
+    with METRICS_PATH.open("r", encoding="utf-8") as fh:
+        return json.load(fh)
+def format_probability(probability: float) -> str:
+    return f"{probability * 100:.2f}%"
+def format_threshold(threshold: float) -> str:
+    return f"{threshold * 100:.0f}%"
+def render_prediction_rows(predictions: list[dict[str, float | str]]) -> None:
+    for index, item in enumerate(predictions, start=1):
+        label = str(item["label"])
+        probability = float(item["probability"])
+        st.write(f"{index}. `{label}`")
+        st.progress(min(max(probability, 0.0), 1.0), text=format_probability(probability))
+def main() -> None:
+    coverage_label = format_threshold(COVERAGE_THRESHOLD)
+    st.title(str(APP_CONFIG["page_title"]))
+    st.write(
+        "This demo predicts arXiv paper topics from the title and abstract using a transformer classifier."
+    )
+    st.caption(
+        "For homework evaluation, the app returns the smallest prefix of categories whose cumulative "
+        f"probability reaches {coverage_label}."
+    )
+    st.info(
+        "How to test: paste a real or synthetic paper title, optionally add an abstract, and press "
+        "`Predict categories`. If the abstract is empty, the model will classify from the title only."
+    )
+    classifier: ArticleClassifier | None = None
+    classifier_load_error: str | None = None
+    with st.sidebar:
+        try:
+            classifier = load_classifier()
+        except Exception as exc:
+            classifier_load_error = f"Model initialization error in load_classifier: {exc}"
+        metrics = load_metrics()
+        st.subheader("Evaluation Summary")
+        st.write(f"Model: `{APP_CONFIG['model_name']}`")
+        if classifier is not None:
+            st.write(f"Number of classes: `{len(classifier.labels)}`")
+            st.write("Classes: " + ", ".join(f"`{label}`" for label in classifier.labels))
+        else:
+            st.error(classifier_load_error or "Model initialization error: unknown error")
+        if metrics is not None:
+            validation_accuracy = metrics.get("validation", {}).get("eval_accuracy")
+            validation_f1 = metrics.get("validation", {}).get("eval_macro_f1")
+            test_accuracy = metrics.get("test", {}).get("test_accuracy")
+            test_f1 = metrics.get("test", {}).get("test_macro_f1")
+            if validation_accuracy is not None:
+                st.write(f"Validation accuracy: `{validation_accuracy:.4f}`")
+            if validation_f1 is not None:
+                st.write(f"Validation macro-F1: `{validation_f1:.4f}`")
+            if test_accuracy is not None:
+                st.write(f"Test accuracy: `{test_accuracy:.4f}`")
+            if test_f1 is not None:
+                st.write(f"Test macro-F1: `{test_f1:.4f}`")
+        st.write(
+            "Output rule: return categories until cumulative probability reaches "
+            f"{coverage_label}"
+        )
+    with st.expander("Example Input For Quick Check"):
+        st.markdown(
+            f"**Title:** {APP_CONFIG['example_title']}\n\n"
+            f"**Abstract:** {APP_CONFIG['example_abstract']}"
+        )
+    with st.form("prediction_form"):
+        title = st.text_input(
+            "Article title",
+            placeholder="Enter the article title",
+        )
+        abstract = st.text_area(
+            "Abstract",
+            placeholder="Enter the abstract (optional, but recommended)",
+            height=220,
+        )
+        predict_button = st.form_submit_button("Predict categories", type="primary")
+    if predict_button:
+        if classifier is None:
+            st.error(classifier_load_error or "Model initialization error: classifier is unavailable.")
+            return
+        if not title.strip() and not abstract.strip():
+            st.error("Input validation error in app: please enter at least a title or an abstract.")
+            return
+        with st.spinner("Running inference..."):
+            try:
+                full_predictions = classifier.predict(title=title, abstract=abstract)
+                predictions = classifier.select_top_k_by_probability_mass(
+                    full_predictions,
+                    threshold=COVERAGE_THRESHOLD,
+                )
+            except ValueError as exc:
+                st.error(str(exc))
+                return
+            except ClassifierError as exc:
+                st.error(f"Classifier error in prediction flow: {exc}")
+                return
+            except Exception as exc:
+                st.error(f"Unexpected inference error in app.main: {exc}")
+                return
+        best_prediction = predictions[0]
+        covered_probability = sum(float(item["probability"]) for item in predictions)
+        col1, col2, col3 = st.columns(3)
+        col1.metric("Top class", str(best_prediction["label"]))
+        col2.metric("Top probability", format_probability(float(best_prediction["probability"])))
+        col3.metric("Top-95% coverage", format_probability(covered_probability))
+        st.subheader("Top categories")
+        st.caption(
+            f"These are the categories returned by the assignment top-{coverage_label} rule."
+        )
+        render_prediction_rows(predictions)
+        with st.expander("Show Full Ranking"):
+            render_prediction_rows(full_predictions)
+if __name__ == "__main__":
+    main()

artifacts/large_model/best_model/config.json ADDED Viewed

	@@ -0,0 +1,52 @@

+{
+  "activation": "gelu",
+  "architectures": [
+    "DistilBertForSequenceClassification"
+  ],
+  "attention_dropout": 0.1,
+  "bos_token_id": null,
+  "dim": 768,
+  "dropout": 0.1,
+  "dtype": "float32",
+  "eos_token_id": null,
+  "hidden_dim": 3072,
+  "id2label": {
+    "0": "astro-ph.GA",
+    "1": "cond-mat.mtrl-sci",
+    "2": "cs.CL",
+    "3": "cs.CV",
+    "4": "cs.RO",
+    "5": "econ.EM",
+    "6": "math.PR",
+    "7": "physics.optics",
+    "8": "q-bio.BM",
+    "9": "quant-ph"
+  },
+  "initializer_range": 0.02,
+  "label2id": {
+    "astro-ph.GA": 0,
+    "cond-mat.mtrl-sci": 1,
+    "cs.CL": 2,
+    "cs.CV": 3,
+    "cs.RO": 4,
+    "econ.EM": 5,
+    "math.PR": 6,
+    "physics.optics": 7,
+    "q-bio.BM": 8,
+    "quant-ph": 9
+  },
+  "max_position_embeddings": 512,
+  "model_type": "distilbert",
+  "n_heads": 12,
+  "n_layers": 6,
+  "pad_token_id": 0,
+  "problem_type": "single_label_classification",
+  "qa_dropout": 0.1,
+  "seq_classif_dropout": 0.2,
+  "sinusoidal_pos_embds": false,
+  "tie_weights_": true,
+  "tie_word_embeddings": true,
+  "transformers_version": "5.5.0",
+  "use_cache": false,
+  "vocab_size": 30522
+}

artifacts/large_model/best_model/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5d67b810b6d752ad27b2d7ec1d3621e75366add609dfe8ef71a32fc3157f0b36
+size 267857176

artifacts/large_model/best_model/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

artifacts/large_model/best_model/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "backend": "tokenizers",
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "is_local": false,
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

artifacts/large_model/best_model/training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1e12cd276b62258c10cf25085394da027012384e6c5100025957fde123d3c1fa
+size 5265

artifacts/large_model/metrics.json ADDED Viewed

	@@ -0,0 +1,22 @@

+{
+  "validation": {
+    "eval_loss": 0.4496897757053375,
+    "eval_accuracy": 0.8695652173913043,
+    "eval_macro_f1": 0.8695864631399942,
+    "eval_runtime": 7.4472,
+    "eval_samples_per_second": 52.503,
+    "eval_steps_per_second": 3.357,
+    "epoch": 3.0
+  },
+  "test": {
+    "test_loss": 0.4383482336997986,
+    "test_accuracy": 0.8788659793814433,
+    "test_macro_f1": 0.8768819420923114,
+    "test_runtime": 7.7828,
+    "test_samples_per_second": 49.853,
+    "test_steps_per_second": 3.212,
+    "epoch": 3.0
+  },
+  "model_name": "distilbert-base-uncased",
+  "num_classes": 10
+}

configs/app_config.json ADDED Viewed

	@@ -0,0 +1,11 @@

+{
+  "model_dir": "artifacts/large_model/best_model",
+  "labels_path": "data/processed_large/label_mapping.json",
+  "max_length": 256,
+  "coverage_threshold": 0.95,
+  "model_name": "distilbert-base-uncased",
+  "page_title": "arXiv Topic Classifier",
+  "page_icon": "📚",
+  "example_title": "Learning-based Visual Navigation for Mobile Robots",
+  "example_abstract": "We present a transformer-based navigation system that uses camera observations and scene understanding to plan robust trajectories for indoor mobile robots."
+}

data/processed_large/label_mapping.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "astro-ph.GA": 0,
+  "cond-mat.mtrl-sci": 1,
+  "cs.CL": 2,
+  "cs.CV": 3,
+  "cs.RO": 4,
+  "econ.EM": 5,
+  "math.PR": 6,
+  "physics.optics": 7,
+  "q-bio.BM": 8,
+  "quant-ph": 9
+}

inference.py ADDED Viewed

	@@ -0,0 +1,156 @@

+from __future__ import annotations
+import json
+from pathlib import Path
+import torch
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+PROJECT_DIR = Path(__file__).resolve().parent
+DEFAULT_MODEL_DIR = PROJECT_DIR / "artifacts" / "large_model" / "best_model"
+DEFAULT_LABELS_PATH = PROJECT_DIR / "data" / "processed_large" / "label_mapping.json"
+class ClassifierError(RuntimeError):
+    pass
+class ArticleClassifier:
+    def __init__(
+        self,
+        model_dir: Path = DEFAULT_MODEL_DIR,
+        labels_path: Path = DEFAULT_LABELS_PATH,
+        max_length: int = 256,
+    ) -> None:
+        self.model_dir = Path(model_dir)
+        self.labels_path = Path(labels_path)
+        self.max_length = max_length
+        self.device = torch.device(
+            "mps"
+            if torch.backends.mps.is_available()
+            else "cuda"
+            if torch.cuda.is_available()
+            else "cpu"
+        )
+        if not self.labels_path.exists():
+            raise ClassifierError(
+                f"Failed to initialize classifier at labels loading stage: labels file not found at {self.labels_path}"
+            )
+        if not self.model_dir.exists():
+            raise ClassifierError(
+                f"Failed to initialize classifier at model loading stage: model directory not found at {self.model_dir}"
+            )
+        try:
+            with self.labels_path.open("r", encoding="utf-8") as fh:
+                self.label2id = json.load(fh)
+        except Exception as exc:
+            raise ClassifierError(
+                f"Failed to initialize classifier at labels loading stage: {exc}"
+            ) from exc
+        if not isinstance(self.label2id, dict) or not self.label2id:
+            raise ClassifierError(
+                "Failed to initialize classifier at labels loading stage: label mapping is empty or invalid"
+            )
+        self.id2label = {idx: label for label, idx in self.label2id.items()}
+        try:
+            self.tokenizer = AutoTokenizer.from_pretrained(self.model_dir)
+            self.model = AutoModelForSequenceClassification.from_pretrained(self.model_dir)
+            self.model.to(self.device)
+            self.model.eval()
+        except Exception as exc:
+            raise ClassifierError(
+                f"Failed to initialize classifier at model loading stage: {exc}"
+            ) from exc
+    @property
+    def labels(self) -> list[str]:
+        return [self.id2label[idx] for idx in sorted(self.id2label)]
+    @staticmethod
+    def build_input_text(title: str, abstract: str) -> str:
+        clean_title = " ".join(title.split()).strip()
+        clean_abstract = " ".join(abstract.split()).strip()
+        if clean_abstract:
+            return f"title: {clean_title} abstract: {clean_abstract}"
+        return f"title: {clean_title}"
+    def predict(self, title: str, abstract: str = "") -> list[dict[str, float | str]]:
+        if not isinstance(title, str):
+            raise ValueError("Input validation error in predict: title must be a string.")
+        if not isinstance(abstract, str):
+            raise ValueError("Input validation error in predict: abstract must be a string.")
+        if not title.strip() and not abstract.strip():
+            raise ValueError(
+                "Input validation error in predict: please provide at least a title or an abstract."
+            )
+        text = self.build_input_text(title=title, abstract=abstract)
+        try:
+            encoded = self.tokenizer(
+                text,
+                return_tensors="pt",
+                truncation=True,
+                max_length=self.max_length,
+            )
+            encoded = {key: value.to(self.device) for key, value in encoded.items()}
+        except Exception as exc:
+            raise ClassifierError(f"Failed during tokenization stage: {exc}") from exc
+        try:
+            with torch.inference_mode():
+                logits = self.model(**encoded).logits
+                probabilities = torch.softmax(logits, dim=-1).squeeze(0).detach().cpu()
+        except Exception as exc:
+            raise ClassifierError(f"Failed during model inference stage: {exc}") from exc
+        results: list[dict[str, float | str]] = []
+        try:
+            for class_id, probability in enumerate(probabilities.tolist()):
+                results.append(
+                    {
+                        "label": self.id2label[class_id],
+                        "probability": float(probability),
+                    }
+                )
+        except Exception as exc:
+            raise ClassifierError(f"Failed during prediction formatting stage: {exc}") from exc
+        results.sort(key=lambda item: item["probability"], reverse=True)
+        return results
+    @staticmethod
+    def select_top_95(
+        predictions: list[dict[str, float | str]],
+    ) -> list[dict[str, float | str]]:
+        return ArticleClassifier.select_top_k_by_probability_mass(
+            predictions=predictions,
+            threshold=0.95,
+        )
+    @staticmethod
+    def select_top_k_by_probability_mass(
+        predictions: list[dict[str, float | str]],
+        threshold: float = 0.95,
+    ) -> list[dict[str, float | str]]:
+        if not 0 < threshold <= 1:
+            raise ValueError("Probability mass threshold must be in the interval (0, 1].")
+        cumulative_probability = 0.0
+        top_predictions: list[dict[str, float | str]] = []
+        for item in predictions:
+            top_predictions.append(item)
+            cumulative_probability += float(item["probability"])
+            if cumulative_probability >= 0.95:
+                break
+        return top_predictions
+    def predict_top_95(self, title: str, abstract: str = "") -> list[dict[str, float | str]]:
+        predictions = self.predict(title=title, abstract=abstract)
+        return self.select_top_95(predictions)

requirements.txt CHANGED Viewed

@@ -1,3 +1,5 @@
-altair
-pandas
-streamlit

+numpy>=1.26
+torch>=2.2,<3.0
+transformers>=4.41
+streamlit>=1.33,<2.0
+safetensors>=0.4