Spaces:

abdullahmubeen10
/

sparknlp-text-summarization

Running

App Files Files Community

abdullahmubeen10 commited on 20 days ago

Commit

d6e48fd

•

1 Parent(s): 3473964

Upload 6 files

Browse files

Files changed (7) hide show

.gitattributes +1 -0
.streamlit/config.toml +3 -0
Demo.py +130 -0
Dockerfile +76 -0
images/T5_model_diagram.jpg +3 -0
pages/Workflow & Model Overview.py +173 -0
requirements.txt +5 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+images/T5_model_diagram.jpg filter=lfs diff=lfs merge=lfs -text

.streamlit/config.toml ADDED Viewed

	@@ -0,0 +1,3 @@

+[theme]
+base="light"
+primaryColor="#29B4E8"

Demo.py ADDED Viewed

	@@ -0,0 +1,130 @@

+import streamlit as st
+import sparknlp
+import os
+from sparknlp.base import *
+from sparknlp.common import *
+from sparknlp.annotator import *
+from pyspark.ml import Pipeline
+from sparknlp.pretrained import PretrainedPipeline
+# Configure Streamlit page
+st.set_page_config(
+    layout="wide",
+    page_title="Spark NLP Demos App",
+    initial_sidebar_state="auto"
+)
+# Custom CSS for better styling
+st.markdown("""
+    <style>
+        .main-title {
+            font-size: 36px;
+            color: #4A90E2;
+            font-weight: bold;
+            text-align: center;
+        }
+        .sub-title {
+            font-size: 24px;
+            color: #333333;
+            margin-top: 20px;
+        }
+        .section {
+            background-color: #f9f9f9;
+            padding: 15px;
+            border-radius: 10px;
+            margin-top: 20px;
+        }
+        .section h2 {
+            font-size: 22px;
+            color: #4A90E2;
+        }
+        .section p, .section ul {
+            color: #666666;
+        }
+    </style>
+""", unsafe_allow_html=True)
+@st.cache_resource
+def init_spark():
+    spark = sparknlp.start()
+    return spark
+@st.cache_resource
+def create_pipeline(model):
+    document_assembler = DocumentAssembler()\
+        .setInputCol("text")\
+        .setOutputCol("documents")
+    t5 = T5Transformer() \
+        .pretrained(model, 'en') \
+        .setTask("summarize:")\
+        .setMaxOutputLength(200)\
+        .setInputCols(["documents"]) \
+        .setOutputCol("summaries")
+    pipeline = Pipeline().setStages([document_assembler, t5])
+    return pipeline
+def fit_data(pipeline, data):
+    empty_df = spark.createDataFrame([['']]).toDF('text')
+    pipeline_model = pipeline.fit(empty_df)
+    model = LightPipeline(pipeline_model)
+    results = model.fullAnnotate(data)[0]
+    return results['summaries'][0].result
+############ SETTING UP THE PAGE LAYOUT ############
+### SIDEBAR CONTENT ###
+# Model selection in sidebar
+model = st.sidebar.selectbox(
+    "Choose the pretrained model",
+    ['t5_base', 't5_small'],
+    help="For more info about the models visit: https://sparknlp.org/models"
+)
+# Colab link for the notebook
+link = """<a href="https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/T5TRANSFORMER.ipynb">
+<img src="https://colab.research.google.com/assets/colab-badge.svg" style="zoom: 1.3" alt="Open In Colab"/>
+</a>"""
+st.sidebar.title('')
+st.sidebar.markdown('Reference notebook:')
+st.sidebar.markdown(link, unsafe_allow_html=True)
+### MAIN CONTENT ###
+# st.title("Summarize Text")
+st.markdown('<div class="main-title">State-of-the-Art Text Summarization with Spark NLP</div>', unsafe_allow_html=True)
+st.write("")
+st.write("")
+st.markdown("""<p>This demo utilizes the <strong>Text-to-Text Transformer (T5)</strong>, introduced by Google researchers in 2019. T5 achieves remarkable results by utilizing a <strong>unique design</strong> that allows it to perform multiple NLP tasks with simple prefixes. For text summarization, the input text is prefixed with <strong>"summarize:"</strong>.</p>""", unsafe_allow_html=True)
+# Sample text options
+options = [
+    "Mount Tai is a mountain of historical and cultural significance located north of the city of Tai'an, in Shandong province, China. The tallest peak is the Jade Emperor Peak, which is commonly reported as being 1,545 meters tall, but is officially described by the PRC government as 1,532.7 meters tall. It is associated with sunrise, birth, and renewal, and is often regarded the foremost of the five. Mount Tai has been a place of worship for at least 3,000 years and served as one of the most important ceremonial centers of China during large portions of this period.",
+    "The Guadeloupe amazon (Amazona violacea) is a hypothetical extinct species of parrot that is thought to have been endemic to the Lesser Antillean island region of Guadeloupe. Described by 17th- and 18th-century writers, it is thought to have been related to, or possibly the same as, the extant imperial amazon. A tibiotarsus and an ulna bone from the island of Marie-Galante may belong to the Guadeloupe amazon. According to contemporary descriptions, its head, neck and underparts were mainly violet or slate, mixed with green and black; the back was brownish green; and the wings were green, yellow and red. It had iridescent feathers, and was able to raise a \"ruff\" of feathers around its neck. It fed on fruits and nuts, and the male and female took turns sitting on the nest. French settlers ate the birds and destroyed their habitat. Rare by 1779, the species appears to have become extinct by the end of the 18th century.",
+    "Pierre-Simon, marquis de Laplace (23 March 1749 – 5 March 1827) was a French scholar and polymath whose work was important to the development of engineering, mathematics, statistics, physics, astronomy, and philosophy. He summarized and extended the work of his predecessors in his five-volume Mécanique Céleste (Celestial Mechanics) (1799–1825). This work translated the geometric study of classical mechanics to one based on calculus, opening up a broader range of problems. In statistics, the Bayesian interpretation of probability was developed mainly by Laplace.",
+    "John Snow (15 March 1813 – 16 June 1858) was an English physician and a leader in the development of anaesthesia and medical hygiene. He is considered one of the founders of modern epidemiology, in part because of his work in tracing the source of a cholera outbreak in Soho, London, in 1854, which he curtailed by removing the handle of a water pump. Snow's findings inspired the adoption of anaesthesia as well as fundamental changes in the water and waste systems of London, which led to similar changes in other cities, and a significant improvement in general public health around the world.",
+    "The Mona Lisa is a half-length portrait painting by Italian artist Leonardo da Vinci. Considered an archetypal masterpiece of the Italian Renaissance, it has been described as \"the best known, the most visited, the most written about, the most sung about, the most parodied work of art in the world\". The painting's novel qualities include the subject's enigmatic expression, the monumentality of the composition, the subtle modelling of forms, and the atmospheric illusionism.",
+    """Calculus, originally called infinitesimal calculus or "the calculus of infinitesimals", is the mathematical study of continuous change, in the same way that geometry is the study of shape and algebra is the study of generalizations of arithmetic operations. It has two major branches, differential calculus and integral calculus; the former concerns instantaneous rates of change, and the slopes of curves, while integral calculus concerns accumulation of quantities, and areas under or between curves. These two branches are related to each other by the fundamental theorem of calculus, and they make use of the fundamental notions of convergence of infinite sequences and infinite series to a well-defined limit.[1] Infinitesimal calculus was developed independently in the late 17th century by Isaac Newton and Gottfried Wilhelm Leibniz.[2][3] Today, calculus has widespread uses in science, engineering, and economics.[4] In mathematics education, calculus denotes courses of elementary mathematical analysis, which are mainly devoted to the study of functions and limits. The word calculus (plural calculi) is a Latin word, meaning originally "small pebble" (this meaning is kept in medicine – see Calculus (medicine)). Because such pebbles were used for calculation, the meaning of the word has evolved and today usually means a method of computation. It is therefore used for naming specific methods of calculation and related theories, such as propositional calculus, Ricci calculus, calculus of variations, lambda calculus, and process calculus.""",
+]
+st.subheader("Summarize text to make it shorter while retaining meaning.")
+# Text input options
+selected_text = st.selectbox("Select an example", options)
+custom_input = st.text_input("Try it for yourself!")
+if custom_input:
+    selected_text = custom_input
+st.subheader('Text')
+st.write(selected_text)
+st.subheader("Summary")
+# Generate summary
+spark = init_spark()
+pipeline = create_pipeline(model)
+output = fit_data(pipeline, selected_text)
+st.write(output)

Dockerfile ADDED Viewed

	@@ -0,0 +1,76 @@

+#Download base image ubuntu 18.04
+FROM ubuntu:18.04
+ENV NB_USER jovyan
+ENV NB_UID 1000
+ENV HOME /home/${NB_USER}
+ENV PYSPARK_PYTHON=python3
+ENV PYSPARK_DRIVER_PYTHON=python3
+RUN apt-get update && apt-get install -y \
+    tar \
+    wget \
+    bash \
+    rsync \
+    gcc \
+    libfreetype6-dev \
+    libhdf5-serial-dev \
+    libpng-dev \
+    libzmq3-dev \
+    python3 \
+    python3-dev \
+    python3-pip \
+    unzip \
+    pkg-config \
+    software-properties-common \
+    graphviz
+RUN adduser --disabled-password \
+    --gecos "Default user" \
+    --uid ${NB_UID} \
+    ${NB_USER}
+# Install OpenJDK-8
+RUN apt-get update && \
+    apt-get install -y openjdk-8-jdk && \
+    apt-get install -y ant && \
+    apt-get clean;
+# Fix certificate issues
+RUN apt-get update && \
+    apt-get install ca-certificates-java && \
+    apt-get clean && \
+    update-ca-certificates -f;
+# Setup JAVA_HOME -- useful for docker commandline
+ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/
+RUN export JAVA_HOME
+RUN echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/" >> ~/.bashrc
+RUN apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
+RUN apt-get update
+RUN apt-get install -y software-properties-common
+RUN add-apt-repository ppa:deadsnakes/ppa
+RUN apt-get install -y python3.8 python3-pip
+ENV PYSPARK_PYTHON=python3.8
+ENV PYSPARK_DRIVER_PYTHON=python3.8
+COPY . .
+RUN python3.8 -m pip install --upgrade pip
+RUN python3.8 -m pip install -r requirements.txt
+USER root
+RUN chown -R ${NB_UID} ${HOME}
+USER ${NB_USER}
+WORKDIR ${HOME}
+COPY . .
+EXPOSE 7860
+ENTRYPOINT ["streamlit", "run", "Home.py", "--server.port=7860", "--server.address=0.0.0.0"]

images/T5_model_diagram.jpg ADDED Viewed

Git LFS Details

SHA256: ecdc448c0c71610fa26d4063fd82edb1b6e879d3cb0e17fd2e8d29565a1ccbc4
Pointer size: 132 Bytes
Size of remote file: 3.15 MB

pages/Workflow & Model Overview.py ADDED Viewed

	@@ -0,0 +1,173 @@

+import streamlit as st
+# Custom CSS for better styling
+st.markdown("""
+    <style>
+        .main-title {
+            font-size: 36px;
+            color: #4A90E2;
+            font-weight: bold;
+            text-align: center;
+        }
+        .sub-title {
+            font-size: 24px;
+            color: #333333;
+            margin-top: 20px;
+        }
+        .section {
+            background-color: #f9f9f9;
+            padding: 15px;
+            border-radius: 10px;
+            margin-top: 20px;
+        }
+        .section h2 {
+            font-size: 22px;
+            color: #4A90E2;
+        }
+        .section p, .section ul {
+            color: #666666;
+        }
+        .link {
+            color: #4A90E2;
+            text-decoration: none;
+        }
+    </style>
+""", unsafe_allow_html=True)
+# Introduction
+st.markdown('<div class="main-title">State-of-the-Art Text Summarization with Spark NLP</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <p>Welcome to the Spark NLP Demos App! In the rapidly evolving field of Natural Language Processing (NLP), the combination of powerful models and scalable frameworks is crucial. One such resource-intensive task is Text Summarization, which benefits immensely from the efficient implementation of machine learning models on distributed systems like Apache Spark.</p>
+    <p>Spark NLP stands out as the leading choice for enterprises building NLP solutions. This open-source library, built in Scala with a Python wrapper, offers state-of-the-art machine learning models within an easy-to-use pipeline design compatible with Spark ML.</p>
+</div>
+""", unsafe_allow_html=True)
+# About the T5 Model
+st.markdown('<div class="sub-title">About the T5 Model</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <p>A standout model for text summarization is the Text-to-Text Transformer (T5), introduced by Google researchers in 2019. T5 achieves remarkable results by utilizing a unique design that allows it to perform multiple NLP tasks with simple prefixes. For text summarization, the input text is prefixed with "summarize:".</p>
+    <p>In Spark NLP, the T5 model is available through the T5Transformer annotator. We'll show you how to use Spark NLP in Python to perform text summarization using the T5 model.</p>
+</div>
+""", unsafe_allow_html=True)
+st.image('https://www.johnsnowlabs.com/wp-content/uploads/2023/09/img_blog_2.jpg', caption='Diagram of the T5 model, from the original paper', use_column_width='auto')
+# How to Use the Model
+st.markdown('<div class="sub-title">How to Use the T5 Model with Spark NLP</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <p>To use the T5Transformer annotator in Spark NLP to perform text summarization, we need to create a pipeline with two stages: the first transforms the input text into an annotation object, and the second stage contains the T5 model.</p>
+</div>
+""", unsafe_allow_html=True)
+st.markdown('### Installation')
+st.code('!pip install spark-nlp', language='python')
+st.markdown('### Import Libraries and Start Spark Session')
+st.code("""
+import sparknlp
+from sparknlp.base import DocumentAssembler, PipelineModel
+from sparknlp.annotator import T5Transformer
+# Start the Spark Session
+spark = sparknlp.start()
+""", language='python')
+st.markdown("""
+<div class="section">
+    <p>Now we can define the pipeline to use the T5 model. We'll use the PipelineModel object since we are using the pretrained model and don’t need to train any stage of the pipeline.</p>
+</div>
+""", unsafe_allow_html=True)
+st.markdown('### Define the Pipeline')
+st.code("""
+# Transforms raw texts into `document` annotation
+document_assembler = (
+    DocumentAssembler().setInputCol("text").setOutputCol("documents")
+)
+# The T5 model
+t5 = (
+    T5Transformer.pretrained("t5_small")
+    .setTask("summarize:")
+    .setInputCols(["documents"])
+    .setMaxOutputLength(200)
+    .setOutputCol("t5")
+)
+# Define the Spark pipeline
+pipeline = PipelineModel(stages = [document_assembler, t5])
+""", language='python')
+st.markdown("""
+<div class="section">
+    <p>To use the model, create a Spark DataFrame containing the input data. In this example, we'll work with a single sentence, but the framework can handle multiple texts for simultaneous processing. The input column from the DocumentAssembler annotator requires a column named “text.”</p>
+</div>
+""", unsafe_allow_html=True)
+st.markdown('### Create Example DataFrame')
+st.code("""
+example = \"""
+Transfer learning, where a model is first pre-trained on a data-rich task
+before being fine-tuned on a downstream task, has emerged as a powerful
+technique in natural language processing (NLP). The effectiveness of transfer
+learning has given rise to a diversity of approaches, methodology, and
+practice. In this paper, we explore the landscape of transfer learning
+techniques for NLP by introducing a unified framework that converts all
+text-based language problems into a text-to-text format.
+Our systematic study compares pre-training objectives, architectures,
+unlabeled data sets, transfer approaches, and other factors on dozens of
+language understanding tasks. By combining the insights from our exploration
+with scale and our new Colossal Clean Crawled Corpus, we achieve
+state-of-the-art results on many benchmarks covering summarization,
+question answering, text classification, and more. To facilitate future
+work on transfer learning for NLP, we release our data set, pre-trained
+models, and code.
+\"""
+spark_df = spark.createDataFrame([[example]])
+""", language='python')
+st.markdown('### Apply the Pipeline')
+st.code("""
+result = pipeline.transform(spark_df)
+result.select("t5.result").show(truncate=False)
+""", language='python')
+st.markdown('<div class="sub-title">Output</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <p>The summarization output will look something like this:</p>
+    <pre>transfer learning has emerged as a powerful technique in natural language processing (NLP) the effectiveness of transfer learning has given rise to a diversity of approaches, methodologies, and practice.</pre>
+    <p>Note: We defined the maximum output length to 200. Depending on the length of the original text, this parameter should be adapted.</p>
+</div>
+""", unsafe_allow_html=True)
+# Additional Resources and References
+st.markdown('<div class="sub-title">Additional Resources and References</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <ul>
+        <li><a class="link" href="https://sparknlp.org/docs/en/transformers#t5transformer" target="_blank">T5Transformer documentation page</a></li>
+        <li><a class="link" href="https://arxiv.org/abs/1910.10683" target="_blank">T5 paper</a></li>
+        <li><a class="link" href="https://sparknlp.org/docs/en/quickstart" target="_blank">Getting Started with Spark NLP</a></li>
+        <li><a class="link" href="https://nlp.johnsnowlabs.com/models" target="_blank">Pretrained Models</a></li>
+        <li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples/python/annotation/text/english" target="_blank">Example Notebooks</a></li>
+        <li><a class="link" href="https://sparknlp.org/docs/en/install" target="_blank">Installation Guide</a></li>
+    </ul>
+</div>
+""", unsafe_allow_html=True)
+st.markdown('<div class="sub-title">Community & Support</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <ul>
+        <li><a class="link" href="https://sparknlp.org/" target="_blank">Official Website</a>: Documentation and examples</li>
+        <li><a class="link" href="https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q" target="_blank">Slack</a>: Live discussion with the community and team</li>
+        <li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp" target="_blank">GitHub</a>: Bug reports, feature requests, and contributions</li>
+        <li><a class="link" href="https://medium.com/spark-nlp" target="_blank">Medium</a>: Spark NLP articles</li>
+        <li><a class="link" href="https://www.youtube.com/channel/UCmFOjlpYEhxf_wJUDuz6xxQ/videos" target="_blank">YouTube</a>: Video tutorials</li>
+    </ul>
+</div>
+""", unsafe_allow_html=True)

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+streamlit
+pandas
+numpy
+spark-nlp
+pyspark