import streamlit as st # Custom CSS for better styling st.markdown(""" """, unsafe_allow_html=True) # Introduction st.markdown('

Scaling Up Text Analysis: Best Practices with Spark NLP n-gram Generation

', unsafe_allow_html=True) st.markdown("""

Welcome to the Spark NLP n-gram Generation Demo App! N-gram generation is a crucial task in Natural Language Processing (NLP) that involves extracting contiguous sequences of n words from text. This is essential for capturing context and identifying meaningful phrases in natural language.

Using Spark NLP, it is possible to efficiently generate n-grams from large-scale text data. This app demonstrates how to use the NGramGenerator annotator to generate n-grams and provides best practices for scaling up text analysis tasks with Spark NLP.

""", unsafe_allow_html=True) # About N-gram Generation st.markdown('

About N-gram Generation

', unsafe_allow_html=True) st.markdown("""

N-gram generation involves extracting contiguous sequences of n words from text. It is a valuable technique for capturing context and identifying meaningful phrases in natural language. With Spark NLP’s ability to handle distributed computing, researchers and practitioners can scale up their text analysis tasks and unlock valuable insights from large volumes of text data.

The NGramGenerator annotator in Spark NLP simplifies the process of generating n-grams by seamlessly integrating with Apache Spark’s distributed computing capabilities. This allows for efficient, accurate, and scalable text analysis.

""", unsafe_allow_html=True) st.image('images/ngram-visual.png', use_column_width='auto') # Using NGramGenerator in Spark NLP st.markdown('

Using NGramGenerator in Spark NLP

', unsafe_allow_html=True) st.markdown("""

The NGramGenerator annotator in Spark NLP allows users to generate n-grams from text data. This annotator supports various configurations and can be easily integrated into NLP pipelines for comprehensive text analysis.

The NGramGenerator annotator in Spark NLP offers:

Efficient n-gram generation for large-scale text data
Support for various n-gram configurations (unigrams, bigrams, trigrams, etc.)
Seamless integration with other Spark NLP components for comprehensive NLP pipelines

""", unsafe_allow_html=True) st.markdown('

Example Usage in Python

', unsafe_allow_html=True) st.markdown('

Here’s how you can implement the NGramGenerator annotator in Spark NLP:

', unsafe_allow_html=True) # Setup Instructions st.markdown('

Setup

', unsafe_allow_html=True) st.markdown('

To install Spark NLP in Python, use your favorite package manager (conda, pip, etc.). For example:

', unsafe_allow_html=True) st.code(""" pip install spark-nlp pip install pyspark """, language="bash") st.markdown("

Then, import Spark NLP and start a Spark session:

", unsafe_allow_html=True) st.code(""" import sparknlp # Start Spark Session spark = sparknlp.start() """, language='python') # Single N-gram Generation Example st.markdown('

Example Usage: Single N-gram Generation with NGramGenerator

', unsafe_allow_html=True) st.code(''' import sparknlp from sparknlp.base import DocumentAssembler, PipelineModel, LightPipeline from sparknlp.annotator import NGramGenerator, Tokenizer import pyspark.sql.functions as F # Start Spark NLP Session spark = sparknlp.start() # Sample Data data = [("1", "This is an example sentence."), ("2", "Spark NLP provides powerful text analysis tools.")] df = spark.createDataFrame(data, ["id", "text"]) # Document Assembler document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document") # Tokenizer tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token") # NGramGenerator ngram = NGramGenerator().setN(2).setInputCols(["token"]).setOutputCol("ngrams") # Building Pipeline pipeline = Pipeline(stages=[document_assembler, tokenizer, ngram]) # Fit and Transform model = pipeline.fit(df) result = model.transform(df) # Display Results result.select("ngrams.result").show(truncate=False) ''', language='python') st.text(""" +---------------------------------------------------------------------------------------------------+ |result | +---------------------------------------------------------------------------------------------------+ |[This is, is an, an example, example sentence, sentence .] | |[Spark NLP, NLP provides, provides powerful, powerful text, text analysis, analysis tools, tools .]| +---------------------------------------------------------------------------------------------------+ """) st.markdown("""

The code snippet demonstrates how to set up a pipeline in Spark NLP to generate n-grams using the NGramGenerator annotator. The resulting output shows the generated bigrams from the input text.

""", unsafe_allow_html=True) # Multi-language N-gram Generation st.markdown('

Scaling Up Text Analysis

', unsafe_allow_html=True) st.markdown("""

In the era of big data, scaling up text analysis tasks is paramount for deriving meaningful insights from vast amounts of textual data. Spark NLP, with its integration with Apache Spark, offers a powerful solution for efficiently processing large-scale text data.

The NGramGenerator annotator in Spark NLP provides an essential tool for generating n-grams from text, enabling the extraction of contextual information, and identifying meaningful phrases.

""", unsafe_allow_html=True) # Summary st.markdown('

Summary

', unsafe_allow_html=True) st.markdown("""

In this demo app, we explored how to generate n-grams using the NGramGenerator annotator in Spark NLP. This is a crucial step in text analysis, allowing us to capture the context and identify meaningful phrases from text data.

Spark NLP, with its integration with Apache Spark, provides a powerful and scalable solution for processing large-scale text data efficiently and accurately.

""", unsafe_allow_html=True) st.markdown("""

Thank you for using the Spark NLP n-gram Generation Demo App. We hope you found it useful and informative!

""", unsafe_allow_html=True) # References and Additional Information st.markdown('

For additional information, please check the following references.

', unsafe_allow_html=True) st.markdown("""

Documentation : NGramGenerator
Python Docs : NGramGenerator
Scala Docs : NGramGenerator
For extended examples of usage, see the Spark NLP Workshop repository.

""", unsafe_allow_html=True) st.markdown('

Community & Support

', unsafe_allow_html=True) st.markdown("""

Official Website: Documentation and examples
Slack: Live discussion with the community and team
GitHub: Bug reports, feature requests, and contributions
Medium: Spark NLP articles
Twitter: Announcements and updates

""", unsafe_allow_html=True)