TopicDig / README.md
green's picture
Update README.md
87d17d7
metadata
title: TopicDig
emoji: πŸ•΅πŸ»
colorFrom: gray
colorTo: yellow
sdk: streamlit
app_file: app.py
pinned: false

Configuration

title: string
Display title for the Space

emoji: string
Space emoji (emoji-only character allowed)

colorFrom: string
Color for Thumbnail gradient (red, yellow, green, blue, indigo, purple, pink, gray)

colorTo: string
Color for Thumbnail gradient (red, yellow, green, blue, indigo, purple, pink, gray)

sdk: string
Can be either gradio or streamlit

sdk_version : string
Only applicable for streamlit SDK.
See doc for more info on supported versions.

app_file: string
Path to your main application file (which contains either gradio or streamlit Python code).
Path is relative to the root of the repository.

pinned: boolean
Whether the Space stays on top of your list.

TopicDig

TopicDig uses whole article summarization to create topical digests from current news headlines

The app displays topics, the user chooses up to three, and the app spins up a topical digest scraped from the headlines. This project makes heavy use of HuggingFace for NLP, and Gazpacho for web scraping.

The method of article selection here is arbitrary. Pre-assigned article tags could be used to select groups of articles, or semantic-similarity methods could be used to evaluate the article text. In practice, an enterprise that would institute such a system would have their articles accessible in a database they own, and would be able to perform background processing to have summaries ready on demand.

The pipeline:

  • Current headlines are scraped from two news sites.
  • NER is performed on each headline to extract topics, some headlines yield no topics.
  • Article links are clustered according to entities in their headlines
  • User selects up to three clusters
  • Articles from those clusters are scraped, the articles summarized in chunks, and the summaries concatenated to create a digest.

This app explores a few ideas:

  • IR for QA and comprehension
    • A cheap and quick way to explore area of research dominated by large, end-to-end trained models like RAG and NewsSum or w.e....TK
  • News delivery and access
    • CNN provides summaries but there's a huge difference between being served something and being able to "create" my news.
    • Sneaks around headlines...what's in the article? Headlines can push and pull....
    • removes control over our attention but enables empowered consumption while keeping news production in the hands of pros.
  • Editorial ideation
    • Can be used to find implied but uncovered stories by creating news assemblages without knowing eactly what you'll get. Even though an editor knows what they're currently covering, imagine them writing a sentence describing each article on a piece of paper -- that's not the same as seeing the information in the final articles assembled and juxtaposed like this.
  • Cross-article information access
    • Information that's related and that paints a picture can be broken across multiple articles from different times...There are more stories lying latent in the told stories.
  • Whole news article summarization pitfalls and windfalls.
    • Doing whole articles...technique and results.
  • Community pantry principle
    • No free lunch but there is a community pantry. It only gets you so close.
  • Evaluating summarization
    • Difficult to objectively evaluate summarization capability beyond a general level.

This application was created as the culmination of a semester of independent graduate research into NLP and transformers.

Original repo for the earlier version of this app is located at https://github.com/mpolinsky/sju_final_project/