News Scraper Model β€” Section-Aware News Summarizer

A fine-tuned facebook/bart-large-cnn that turns full news articles into ~60-word, Inshorts-style summaries with section-specific structure: a Crime story leads with who was charged, a Business story leads with the key number, a Sport story leads with the result.

How to use

This model was trained on prompt-prefixed inputs, so you must wrap the article in the same instruction format used during training (a Tech example):

from transformers import BartForConditionalGeneration, BartTokenizer

tokenizer = BartTokenizer.from_pretrained("aniket23/news_scraper_model")
model = BartForConditionalGeneration.from_pretrained("aniket23/news_scraper_model")

title = "OpenAI launches GPT-5"
text  = "OpenAI today unveiled GPT-5, claiming major reasoning improvements ..."

prompt = (
    "Summarise as Tech news: Start with the company or product name. Include "
    "what was launched, announced, or discovered, key specs or numbers that "
    "matter, who is affected, and why it changes the industry or everyday users."
    f"\n\nArticle: {title}. {text}"
)

inputs = tokenizer(prompt, max_length=512, truncation=True, return_tensors="pt")
ids = model.generate(
    **inputs,
    max_new_tokens=110, min_new_tokens=50,
    num_beams=4, length_penalty=2.0, early_stopping=True,
)
print(tokenizer.decode(ids[0], skip_special_tokens=True))

The full pipeline β€” scraping, section classification, and the exact prompt for each of the 20 sections β€” is on GitHub: AniketMishra23/news_scraper_model

Sections

Crime, Tech, Politics, Business, Science, Sport, Entertainment, Lifestyle, World, Health, Education, Property, Environment, Defence, Travel, Immigration, Law, Economy, Arts, Personal Finance β€” each with its own summary structure.

Training

  • Base model: facebook/bart-large-cnn
  • Data: ~700 news articles scraped from 25 RSS feeds, labelled via knowledge-distillation bootstrapping (base BART generates the target summary for each section-prefixed input)
  • Hardware: RTX 4060 Laptop GPU (8 GB), fp16, ~10 min
  • Selection: best checkpoint by validation ROUGE-L, with early stopping
Metric (validation) Value
ROUGE-1 / ROUGE-2 / ROUGE-L 0.61 / 0.50 / 0.56
Average summary length ~57 words

ROUGE is measured against the bootstrap labels, so it reflects consistency with the teacher model rather than human-judged quality. Practical strengths: consistent length, section-appropriate structure, complete sentences.

Limitations

  • Trained on bootstrap (not human-written) labels β€” quality ceiling is roughly "base BART, but length-controlled and section-aware".
  • Section classification is keyword-based and can misfire on ambiguous articles.
  • English news only.

License

MIT (inherited from facebook/bart-large-cnn).

Downloads last month
28
Safetensors
Model size
0.4B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for aniket23/news_scraper_model

Finetuned
(434)
this model