Spaces:

spark-nlp
/

ngram-generation

Sleeping

App Files Files Community

abdullahmubeen10 commited on Jul 7, 2024

Commit

4f52f17

verified ·

1 Parent(s): d8be452

Upload 12 files

Browse files

Files changed (12) hide show

.streamlit/config.toml +3 -0
Demo.py +148 -0
Dockerfile +70 -0
images/ngram-generation.jpg +0 -0
images/ngram-visual.png +0 -0
inputs/date_matcher/Example1.txt +5 -0
inputs/date_matcher/Example2.txt +5 -0
inputs/date_matcher/Example3.txt +6 -0
inputs/date_matcher/Example4.txt +4 -0
inputs/date_matcher/Example5.txt +3 -0
pages/Workflow & Model Overview.py +191 -0
requirements.txt +5 -0

.streamlit/config.toml ADDED Viewed

	@@ -0,0 +1,3 @@

+[theme]
+base="light"
+primaryColor="#29B4E8"

Demo.py ADDED Viewed

	@@ -0,0 +1,148 @@

+import streamlit as st
+import sparknlp
+from sparknlp.base import DocumentAssembler, LightPipeline
+from sparknlp.annotator import SentenceDetectorDLModel
+from sparknlp.pretrained import PipelineModel
+from pyspark.ml import Pipeline
+import pandas as pd
+# Page configuration
+st.set_page_config(
+    layout="wide",
+    page_title="Spark NLP Demos App",
+    initial_sidebar_state="auto"
+)
+# CSS for styling
+st.markdown("""
+    <style>
+        .main-title {
+            font-size: 36px;
+            color: #4A90E2;
+            font-weight: bold;
+            text-align: center;
+        }
+        .section p, .section ul {
+            color: #666666;
+        }
+        .box {
+            text-align: left;
+            font-family: "IBM Plex Sans", sans-serif;
+            font-weight: normal;
+            width: 100%;
+            box-sizing: border-box;
+            position: relative;
+            font-size: 14px !important;
+            line-height: 26px !important;
+            color: #536B76 !important;
+        }
+        h3 {
+            text-align: centre;
+            box-sizing: border-box;
+            padding: 0;
+            margin: 25px 0 0 !important;
+            font-family: 'Montserrat', sans-serif !important;
+            font-weight: 500 !important;
+            font-size: 18px !important;
+            line-height: 22px;
+            color: #1E77B7 !important;
+    }
+    </style>
+""", unsafe_allow_html=True)
+# Initialize Spark NLP
+@st.cache_resource
+def init_spark():
+    return sparknlp.start()
+# Create Spark NLP pipeline
+@st.cache_resource
+def create_pipeline(n):
+    document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
+    tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token")
+    ngram = NGramGenerator().setN(n).setInputCols(["token"]).setOutputCol("ngrams")
+    pipeline = Pipeline(stages=[document_assembler, tokenizer, ngram])
+    return pipeline
+# Function to fit data to the pipeline and get results
+def fit_data(pipeline, data):
+    df = spark.createDataFrame([[data]]).toDF("text")
+    model = pipeline.fit(df)
+    light_pipeline = LightPipeline(model)
+    results = light_pipeline.fullAnnotate(data)
+    return results
+# Set up the page layout
+st.markdown('<div class="main-title">State-of-the-Art NGram Generation with Spark NLP</div>', unsafe_allow_html=True)
+st.markdown("<h3>Generate meaningful n-grams from text data using Spark NLP's efficient and scalable NGramGenerator, capturing context and identifying key phrases even in large-scale, noisy datasets.</h3>", unsafe_allow_html=True)
+st.write("")
+# Sidebar configuration
+NGram_selection_list = {"Unigram": 1, "bigram": 2, "trigram": 3}
+NGram = st.sidebar.selectbox(
+    "Choose an NGram specification",
+    list(NGram_selection_list.keys()),
+    help="For more info about the models visit: https://sparknlp.org/models"
+)
+# Add the Colab link for the notebook
+colab_link = """
+<a href="https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb">
+    <img src="https://colab.research.google.com/assets/colab-badge.svg" style="zoom: 1.3" alt="Open In Colab"/>
+</a>
+"""
+st.sidebar.title('Reference notebook:')
+st.sidebar.markdown(colab_link, unsafe_allow_html=True)
+# Sample texts for sentence detection
+examples = [
+    "Brexit: U.K to ban more EU citizens with criminal records. In a recent Home Office meeting a consensus has been made to tighten border policies. However a no-deal Brexit could make it harder to identify foreign criminals; With the U.K in a transition period since it formally left the EU in January, an EU citizen can currently only be refused entry if they present a genuine, present and serious threat.",
+    "Harry Harding on the US, China, and a ‘Cold War 2.0’. “Calling it a second Cold War is misleading, but to deny that it’s a Cold War is also disingenuous.”, Harding is a specialist on Asia and U.S.-Asian relations. His major publications include Organizing China: The Problem of Bureaucracy, 1949-1966.The phrase “new Cold War” is an example of the use of analogies in understanding the world.The world is a very complicated place.People like to find ways of coming to a clearer and simpler understanding.",
+    "Tesla’s latest quarterly numbers beat analyst expectations on both revenue and earnings per share, bringing in $8.77 billion in revenues for the third quarter.That’s up 39% from the year-ago period.Wall Street had expected $8.36 billion in revenue for the quarter, according to estimates published by CNBC. Revenue grew 30% year-on-year, something the company attributed to substantial growth in vehicle deliveries, and operating income also grew to $809 million, showing improving operating margins to 9.2%.",
+    "2020 is another year that is consistent with a rapidly changing Arctic.Without a systematic reduction in greenhouse gases, the likelihood of our first ‘ice-free’ summer will continue to increase by the mid-21st century;It is already well known that a smaller ice sheet means less of a white area to reflect the sun’s heat back into space. But this is not the only reason the Arctic is warming more than twice as fast as the global average",
+    "HBR: The world is changing in rapid, unprecedented ways, but one thing remains certain: as businesses look to embed lessons learned in recent months and to build enterprise resilience for the future, they are due for even more transformation.As such, most organizations are voraciously evaluating existing and future technologies to see if they’ll be able to deliver the innovation at scale that they’ll need to survive and thrive.However, technology should not be central to these transformation efforts; people should."
+]
+# User input for text selection
+selected_text = st.selectbox("Select a sample text", examples)
+custom_input = st.text_input("Try it for yourself!")
+if custom_input:
+    selected_text = custom_input
+st.subheader('Selected Text')
+st.write(selected_text)
+# Run the pipeline and display results
+spark = init_spark()
+Pipeline = create_pipeline(NGram_selection_list[NGram])
+output = fit_data(Pipeline, selected_text)
+# Display detected sentences
+st.subheader('Genrated NGrams')
+data = [ngram.result for ngram in output[0]['ngrams']]
+df = pd.DataFrame(data)
+st.dataframe(df)

Dockerfile ADDED Viewed

	@@ -0,0 +1,70 @@

+# Download base image ubuntu 18.04
+FROM ubuntu:18.04
+# Set environment variables
+ENV NB_USER jovyan
+ENV NB_UID 1000
+ENV HOME /home/${NB_USER}
+# Install required packages
+RUN apt-get update && apt-get install -y \
+    tar \
+    wget \
+    bash \
+    rsync \
+    gcc \
+    libfreetype6-dev \
+    libhdf5-serial-dev \
+    libpng-dev \
+    libzmq3-dev \
+    python3 \
+    python3-dev \
+    python3-pip \
+    unzip \
+    pkg-config \
+    software-properties-common \
+    graphviz \
+    openjdk-8-jdk \
+    ant \
+    ca-certificates-java \
+    && apt-get clean \
+    && update-ca-certificates -f;
+# Install Python 3.8 and pip
+RUN add-apt-repository ppa:deadsnakes/ppa \
+    && apt-get update \
+    && apt-get install -y python3.8 python3-pip \
+    && apt-get clean;
+# Set up JAVA_HOME
+ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/
+RUN mkdir -p ${HOME} \
+    && echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/" >> ${HOME}/.bashrc \
+    && chown -R ${NB_UID}:${NB_UID} ${HOME}
+# Create a new user named "jovyan" with user ID 1000
+RUN useradd -m -u ${NB_UID} ${NB_USER}
+# Switch to the "jovyan" user
+USER ${NB_USER}
+# Set home and path variables for the user
+ENV HOME=/home/${NB_USER} \
+    PATH=/home/${NB_USER}/.local/bin:$PATH
+# Set the working directory to the user's home directory
+WORKDIR ${HOME}
+# Upgrade pip and install Python dependencies
+RUN python3.8 -m pip install --upgrade pip
+COPY requirements.txt /tmp/requirements.txt
+RUN python3.8 -m pip install -r /tmp/requirements.txt
+# Copy the application code into the container at /home/jovyan
+COPY --chown=${NB_USER}:${NB_USER} . ${HOME}
+# Expose port for Streamlit
+EXPOSE 7860
+# Define the entry point for the container
+ENTRYPOINT ["streamlit", "run", "Demo.py", "--server.port=7860", "--server.address=0.0.0.0"]

images/ngram-generation.jpg ADDED Viewed

images/ngram-visual.png ADDED Viewed

inputs/date_matcher/Example1.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+David visited the restaurant yesterday with his family.  He also visited and the day before, but at ...
+David visited the restaurant yesterday with his family.
+He also visited and the day before, but at that time he was alone.
+David again visited today with his colleagues.
+He and his friends really liked the food and hoped to visit again tomorrow.

inputs/date_matcher/Example2.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+In March 2003 she was seen in the office and appeared to be extremely disturbed emotionally. On 2003...
+In March 2003 she was seen in the office and appeared to be extremely disturbed emotionally.
+On 2003-04-04 she again visited and talked about the effects of the medication she has been taking, and seemed positive and in much better shape.
+She again visited on Fri, 22/4/2003 and looked better.
+She has been working out and taking her medicines since April 1st 2003.

inputs/date_matcher/Example3.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+I have a very busy schedule these days. I have meetings from 7pm. till 11pm. I have 3 meetings the d...
+I have a very busy schedule these days. I have meetings from 7pm. till 11pm.
+I have 3 meetings the day after, and have submission deadlines approaching as well.
+By next mon I have to finalise the architecture, for which i'll have to hold multiple meetings with ARM.
+Then i'll have to discuss dev plans by next tuesday and develop a thorough plan.
+The plan should be ready by Nov 30th.

inputs/date_matcher/Example4.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+When Tom visited the Bahamas last year, it was his first time travelling. Since then he was travelle...
+When Tom visited the Bahamas last year, it was his first time travelling.
+Since then he was travelled a lot. For example, he visited Hawaii last week.
+The last time we talked, he was planning to travel to Alaska next month.

inputs/date_matcher/Example5.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+Isn't it weird that all my family members have the same birth day and month? All of us were born on ...
+Isn't it weird that all my family members have the same birth day and month? All of us were born on 1st Jan
+    Dad was born on 01/01/1900. Mom has a birth date of 1st Jan 1902. And I was born on 2000/01/01

pages/Workflow & Model Overview.py ADDED Viewed

	@@ -0,0 +1,191 @@

+import streamlit as st
+# Custom CSS for better styling
+st.markdown("""
+    <style>
+        .main-title {
+            font-size: 36px;
+            color: #4A90E2;
+            font-weight: bold;
+            text-align: center;
+        }
+        .sub-title {
+            font-size: 24px;
+            color: #4A90E2;
+            margin-top: 20px;
+        }
+        .section {
+            background-color: #f9f9f9;
+            padding: 15px;
+            border-radius: 10px;
+            margin-top: 20px;
+        }
+        .section h2 {
+            font-size: 22px;
+            color: #4A90E2;
+        }
+        .section p, .section ul {
+            color: #666666;
+        }
+        .link {
+            color: #4A90E2;
+            text-decoration: none;
+        }
+    </style>
+""", unsafe_allow_html=True)
+# Introduction
+st.markdown('<div class="main-title">Scaling Up Text Analysis: Best Practices with Spark NLP n-gram Generation</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <p>Welcome to the Spark NLP n-gram Generation Demo App! N-gram generation is a crucial task in Natural Language Processing (NLP) that involves extracting contiguous sequences of n words from text. This is essential for capturing context and identifying meaningful phrases in natural language.</p>
+    <p>Using Spark NLP, it is possible to efficiently generate n-grams from large-scale text data. This app demonstrates how to use the NGramGenerator annotator to generate n-grams and provides best practices for scaling up text analysis tasks with Spark NLP.</p>
+</div>
+""", unsafe_allow_html=True)
+# About N-gram Generation
+st.markdown('<div class="sub-title">About N-gram Generation</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <p>N-gram generation involves extracting contiguous sequences of n words from text. It is a valuable technique for capturing context and identifying meaningful phrases in natural language. With Spark NLP’s ability to handle distributed computing, researchers and practitioners can scale up their text analysis tasks and unlock valuable insights from large volumes of text data.</p>
+    <p>The NGramGenerator annotator in Spark NLP simplifies the process of generating n-grams by seamlessly integrating with Apache Spark’s distributed computing capabilities. This allows for efficient, accurate, and scalable text analysis.</p>
+</div>
+""", unsafe_allow_html=True)
+st.image('images/ngram-visual.png', use_column_width='auto')
+# Using NGramGenerator in Spark NLP
+st.markdown('<div class="sub-title">Using NGramGenerator in Spark NLP</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <p>The NGramGenerator annotator in Spark NLP allows users to generate n-grams from text data. This annotator supports various configurations and can be easily integrated into NLP pipelines for comprehensive text analysis.</p>
+    <p>The NGramGenerator annotator in Spark NLP offers:</p>
+    <ul>
+        <li>Efficient n-gram generation for large-scale text data</li>
+        <li>Support for various n-gram configurations (unigrams, bigrams, trigrams, etc.)</li>
+        <li>Seamless integration with other Spark NLP components for comprehensive NLP pipelines</li>
+    </ul>
+</div>
+""", unsafe_allow_html=True)
+st.markdown('<h2 class="sub-title">Example Usage in Python</h2>', unsafe_allow_html=True)
+st.markdown('<p>Here’s how you can implement the NGramGenerator annotator in Spark NLP:</p>', unsafe_allow_html=True)
+# Setup Instructions
+st.markdown('<div class="sub-title">Setup</div>', unsafe_allow_html=True)
+st.markdown('<p>To install Spark NLP in Python, use your favorite package manager (conda, pip, etc.). For example:</p>', unsafe_allow_html=True)
+st.code("""
+pip install spark-nlp
+pip install pyspark
+""", language="bash")
+st.markdown("<p>Then, import Spark NLP and start a Spark session:</p>", unsafe_allow_html=True)
+st.code("""
+import sparknlp
+# Start Spark Session
+spark = sparknlp.start()
+""", language='python')
+# Single N-gram Generation Example
+st.markdown('<div class="sub-title">Example Usage: Single N-gram Generation with NGramGenerator</div>', unsafe_allow_html=True)
+st.code('''
+import sparknlp
+from sparknlp.base import DocumentAssembler, PipelineModel, LightPipeline
+from sparknlp.annotator import NGramGenerator, Tokenizer
+import pyspark.sql.functions as F
+# Start Spark NLP Session
+spark = sparknlp.start()
+# Sample Data
+data = [("1", "This is an example sentence."),
+        ("2", "Spark NLP provides powerful text analysis tools.")]
+df = spark.createDataFrame(data, ["id", "text"])
+# Document Assembler
+document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
+# Tokenizer
+tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token")
+# NGramGenerator
+ngram = NGramGenerator().setN(2).setInputCols(["token"]).setOutputCol("ngrams")
+# Building Pipeline
+pipeline = Pipeline(stages=[document_assembler, tokenizer, ngram])
+# Fit and Transform
+model = pipeline.fit(df)
+result = model.transform(df)
+# Display Results
+result.select("ngrams.result").show(truncate=False)
+''', language='python')
+st.text("""
++---------------------------------------------------------------------------------------------------+
+|result                                                                                             |
++---------------------------------------------------------------------------------------------------+
+|[This is, is an, an example, example sentence, sentence .]                                         |
+|[Spark NLP, NLP provides, provides powerful, powerful text, text analysis, analysis tools, tools .]|
++---------------------------------------------------------------------------------------------------+
+""")
+st.markdown("""
+<p>The code snippet demonstrates how to set up a pipeline in Spark NLP to generate n-grams using the NGramGenerator annotator. The resulting output shows the generated bigrams from the input text.</p>
+""", unsafe_allow_html=True)
+# Multi-language N-gram Generation
+st.markdown('<div class="sub-title">Scaling Up Text Analysis</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <p>In the era of big data, scaling up text analysis tasks is paramount for deriving meaningful insights from vast amounts of textual data. Spark NLP, with its integration with Apache Spark, offers a powerful solution for efficiently processing large-scale text data.</p>
+    <p>The NGramGenerator annotator in Spark NLP provides an essential tool for generating n-grams from text, enabling the extraction of contextual information, and identifying meaningful phrases.</p>
+</div>
+""", unsafe_allow_html=True)
+# Summary
+st.markdown('<div class="sub-title">Summary</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <p>In this demo app, we explored how to generate n-grams using the NGramGenerator annotator in Spark NLP. This is a crucial step in text analysis, allowing us to capture the context and identify meaningful phrases from text data.</p>
+    <p>Spark NLP, with its integration with Apache Spark, provides a powerful and scalable solution for processing large-scale text data efficiently and accurately.</p>
+</div>
+""", unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <p>Thank you for using the Spark NLP n-gram Generation Demo App. We hope you found it useful and informative!</p>
+</div>
+""", unsafe_allow_html=True)
+# References and Additional Information
+st.markdown('<div class="sub-title">For additional information, please check the following references.</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <ul>
+        <li>Documentation : <a href="https://nlp.johnsnowlabs.com/docs/en/annotators#ngramgenerator" target="_blank" rel="noopener">NGramGenerator</a></li>
+        <li>Python Docs : <a href="https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/ngramgenerator/index.html" target="_blank" rel="noopener">NGramGenerator</a></li>
+        <li>Scala Docs : <a href="https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/NGramGenerator.html" target="_blank" rel="noopener">NGramGenerator</a></li>
+        <li>For extended examples of usage, see the <a href="https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb" target="_blank" rel="noopener">Spark NLP Workshop repository</a>.</li>
+    </ul>
+</div>
+""", unsafe_allow_html=True)
+st.markdown('<div class="sub-title">Community & Support</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <ul>
+        <li><a class="link" href="https://sparknlp.org/" target="_blank">Official Website</a>: Documentation and examples</li>
+        <li><a class="link" href="https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q" target="_blank">Slack</a>: Live discussion with the community and team</li>
+        <li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp" target="_blank">GitHub</a>: Bug reports, feature requests, and contributions</li>
+        <li><a class="link" href="https://medium.com/spark-nlp" target="_blank">Medium</a>: Spark NLP articles</li>
+        <li><a class="link" href="https://twitter.com/spark_nlp" target="_blank">Twitter</a>: Announcements and updates</li>
+    </ul>
+</div>
+""", unsafe_allow_html=True)

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+streamlit
+pandas
+numpy
+spark-nlp
+pyspark