mikemcrae25's picture
Update README.md
eebd529 verified
|
raw
history blame
2.99 kB
metadata
license: mit
language:
  - en
metrics:
  - accuracy
  - bertscore
  - f1
base_model:
  - google-bert/bert-base-uncased
pipeline_tag: text-classification

Newswire Classifier (AP, UPI, NEA) - BERT Transformers

πŸ“˜ Overview

This repository contains three separately trained BERT models for identifying whether a newspaper article was produced by one of three major newswire services:

  • AP (Associated Press)
  • UPI (United Press International)
  • NEA (Newspaper Enterprise Association)

The models are designed for historical news classification from public-domain newswire articles (1960–1975).

🧠 Model Architecture

  • Base Model: bert-base-uncased
  • Task: Binary classification (1 if from the specific newswire, 0 otherwise)
  • Optimizer: AdamW
  • Loss Function: Binary Cross-Entropy with Logits
  • Batch Size: 16
  • Epochs: 4
  • Learning Rate: 2e-5
  • Device: TPU (v2-8) in Google Colab

πŸ“Š Training Data

  • Source: Historical newspapers (1960–1975, public domain)
  • Articles: 4000 per training round (1000 from target newswire, 3000 from other sources)
  • Features Used: Headline, author, and first 100 characters of the article.
  • Labeling: 1 for articles from the target newswire, 0 for all others.

πŸš€ Model Performance

Model Accuracy Precision Recall F1 Score
AP 0.9925 0.9926 0.9925 0.9925
UPI 0.9999 0.9999 0.9999 0.9999
NEA 0.9875 0.9880 0.9875 0.9876

πŸ› οΈ Usage

Installation

pip install transformers torch

Example Inference (AP Classifier)

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("mike-mcrae/newswire_classifier/AP")
tokenizer = AutoTokenizer.from_pretrained("mike-mcrae/newswire_classifier/AP")

text = "(AP) President speaks at conference..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
outputs = model(**inputs)
prediction = outputs.logits.argmax().item()
print("AP Article" if prediction == 1 else "Not AP Article")

βš™οΈ Recommended Usage Notes

  • The models were trained on a combination of the first 100 characters of headline + author + the first 100 characters of articles, as the mention of the newswire often appears in these sections. Using the same format for inference may improve accuracy.

πŸ“œ Licensing & Data Source

  • Training Data: Historical newspaper articles (1960–1975) from public-domain sources.
  • License: Public domain (for data) and MIT License (for model and code).

πŸ’¬ Citation

If you use these models, please cite:

@misc{newswire_classifier,
  author = {McRae, Michael},
  title = {Newswire Classifier (AP, UPI, NEA) - BERT Transformers},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/username/newswire_classifier}
}