--- license: mit language: - en metrics: - accuracy - bertscore - f1 base_model: - google-bert/bert-base-uncased pipeline_tag: text-classification --- # Newswire Classifier (AP, UPI, NEA) - BERT Transformers ## 📘 Overview This repository contains three separately trained BERT models for identifying whether a newspaper article was produced by one of three major newswire services: - **AP (Associated Press)** - **UPI (United Press International)** - **NEA (Newspaper Enterprise Association)** The models are designed for historical news classification from public-domain newswire articles (1960–1975). ## 🧠 Model Architecture - **Base Model:** `bert-base-uncased` - **Task:** Binary classification (`1` if from the specific newswire, `0` otherwise) - **Optimizer:** AdamW - **Loss Function:** Binary Cross-Entropy with Logits - **Batch Size:** 16 - **Epochs:** 4 - **Learning Rate:** 2e-5 - **Device:** TPU (v2-8) in Google Colab ## 📊 Training Data - **Source:** Historical newspapers (1960–1975, public domain) - **Articles:** 4000 per training round (1000 from target newswire, 3000 from other sources) - **Features Used:** Headline, author, and first 100 characters of the article. - **Labeling:** `1` for articles from the target newswire, `0` for all others. ## 🚀 Model Performance | Model | Accuracy | Precision | Recall | F1 Score | |-------|----------|----------|-------|----------| | **AP** | 0.9925 | 0.9926 | 0.9925 | 0.9925 | | **UPI** | 0.9999 | 0.9999 | 0.9999 | 0.9999 | | **NEA** | 0.9875 | 0.9880 | 0.9875 | 0.9876 | ## 🛠️ Usage ### Installation ```bash pip install transformers torch ``` ### Example Inference (AP Classifier) ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer model = AutoModelForSequenceClassification.from_pretrained("mike-mcrae/newswire_classifier/AP") tokenizer = AutoTokenizer.from_pretrained("mike-mcrae/newswire_classifier/AP") text = "(AP) President speaks at conference..." inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128) outputs = model(**inputs) prediction = outputs.logits.argmax().item() print("AP Article" if prediction == 1 else "Not AP Article") ``` ## ⚙️ Recommended Usage Notes - The models were trained on a combination of the first 100 characters of headline + author + the first 100 characters of articles, as the mention of the newswire often appears in these sections. Using the same format for inference may improve accuracy. ## 📜 Licensing & Data Source - **Training Data:** Historical newspaper articles (1960–1975) from public-domain sources. - **License:** Public domain (for data) and MIT License (for model and code). ## 💬 Citation If you use these models, please cite: ``` @misc{newswire_classifier, author = {McRae, Michael}, title = {Newswire Classifier (AP, UPI, NEA) - BERT Transformers}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/username/newswire_classifier} } ```