News Scraper Model — Section-Aware News Summarizer

A fine-tuned facebook/bart-large-cnn that turns full news articles into ~60-word, Inshorts-style summaries with section-specific structure: a Crime story leads with who was charged, a Business story leads with the key number, a Sport story leads with the result.

How to use

This model was trained on prompt-prefixed inputs, so you must wrap the article in the same instruction format used during training (a Tech example):

from transformers import BartForConditionalGeneration, BartTokenizer

tokenizer = BartTokenizer.from_pretrained("aniket23/news_scraper_model")
model = BartForConditionalGeneration.from_pretrained("aniket23/news_scraper_model")

title = "OpenAI launches GPT-5"
text  = "OpenAI today unveiled GPT-5, claiming major reasoning improvements ..."

prompt = (
    "Summarise as Tech news: Start with the company or product name. Include "
    "what was launched, announced, or discovered, key specs or numbers that "
    "matter, who is affected, and why it changes the industry or everyday users."
    f"\n\nArticle: {title}. {text}"
)

inputs = tokenizer(prompt, max_length=512, truncation=True, return_tensors="pt")
ids = model.generate(
    **inputs,
    max_new_tokens=110, min_new_tokens=50,
    num_beams=4, length_penalty=2.0, early_stopping=True,
)
print(tokenizer.decode(ids[0], skip_special_tokens=True))

The full pipeline — scraping, section classification, and the exact prompt for each of the 20 sections — is on GitHub: AniketMishra23/news_scraper_model

Sections

Crime, Tech, Politics, Business, Science, Sport, Entertainment, Lifestyle, World, Health, Education, Property, Environment, Defence, Travel, Immigration, Law, Economy, Arts, Personal Finance — each with its own summary structure.

Training

Base model: facebook/bart-large-cnn
Data: ~700 news articles scraped from 25 RSS feeds, labelled via knowledge-distillation bootstrapping (base BART generates the target summary for each section-prefixed input)
Hardware: RTX 4060 Laptop GPU (8 GB), fp16, ~10 min
Selection: best checkpoint by validation ROUGE-L, with early stopping

Metric (validation)	Value
ROUGE-1 / ROUGE-2 / ROUGE-L	0.61 / 0.50 / 0.56
Average summary length	~57 words

ROUGE is measured against the bootstrap labels, so it reflects consistency with the teacher model rather than human-judged quality. Practical strengths: consistent length, section-appropriate structure, complete sentences.

Limitations

Trained on bootstrap (not human-written) labels — quality ceiling is roughly "base BART, but length-controlled and section-aware".
Section classification is keyword-based and can misfire on ambiguous articles.
English news only.

License

MIT (inherited from facebook/bart-large-cnn).

Downloads last month: 28

Safetensors

Model size

0.4B params

Tensor type

F32

Model tree for aniket23/news_scraper_model

Base model

facebook/bart-large-cnn

Finetuned

(434)

this model