import streamlit as st from sparknlp.base import DocumentAssembler, Pipeline from sparknlp.annotator import DateMatcher, MultiDateMatcher from pyspark.sql.types import StringType import pyspark.sql.functions as F import sparknlp # Custom CSS for better styling st.markdown(""" """, unsafe_allow_html=True) # Introduction st.markdown('

Date Extraction with Spark NLP

', unsafe_allow_html=True) st.markdown("""

Welcome to the Spark NLP Date Extraction Demo App! Date extraction is a crucial task in Natural Language Processing (NLP) that involves identifying and extracting references to dates in text data. This can be useful for a wide range of applications, such as event scheduling, social media monitoring, and financial forecasting.

Using Spark NLP, it is possible to identify and extract dates from a text with high accuracy. This app demonstrates how to use the DateMatcher and MultiDateMatcher annotators to extract dates from text data.

""", unsafe_allow_html=True) st.image('images/Extracting-Exact-Dates.jpg', use_column_width='auto') # About Date Extraction st.markdown('

About Date Extraction

', unsafe_allow_html=True) st.markdown("""

Date extraction involves identifying and extracting references to dates in text data. This can be achieved using various techniques such as regular expressions, Named Entity Recognition (NER), and rule-based systems.

Spark NLP provides powerful tools for date extraction, including the DateMatcher and MultiDateMatcher annotators, which use pattern matching to extract date expressions from text.

""", unsafe_allow_html=True) # Using DateMatcher in Spark NLP st.markdown('

Using DateMatcher in Spark NLP

', unsafe_allow_html=True) st.markdown("""

The DateMatcher annotator in Spark NLP allows users to extract specific date patterns from text data. This annotator can identify dates in various formats, providing valuable insights from unstructured text data.

The DateMatcher annotator in Spark NLP offers:

Flexible date pattern matching
Extraction of single date occurrences
Efficient processing of large text datasets
Integration with other Spark NLP components for comprehensive NLP pipelines

""", unsafe_allow_html=True) st.markdown('

Example Usage in Python

', unsafe_allow_html=True) st.markdown('

Here’s how you can implement DateMatcher and MultiDateMatcher annotators in Spark NLP:

', unsafe_allow_html=True) # Setup Instructions st.markdown('

Setup

', unsafe_allow_html=True) st.markdown('

To install Spark NLP in Python, use your favorite package manager (conda, pip, etc.). For example:

', unsafe_allow_html=True) st.code(""" pip install spark-nlp pip install pyspark """, language="bash") st.markdown("

Then, import Spark NLP and start a Spark session:

", unsafe_allow_html=True) st.code(""" import sparknlp # Start Spark Session spark = sparknlp.start() """, language='python') # Single Date Extraction Example st.markdown('

Example Usage: Single Date Extraction with DateMatcher

', unsafe_allow_html=True) st.code(''' from sparknlp.base import DocumentAssembler, Pipeline from sparknlp.annotator import DateMatcher import pyspark.sql.functions as F # Step 1: Transforms raw texts to `document` annotation document_assembler = ( DocumentAssembler() .setInputCol("text") .setOutputCol("document") ) # Step 2: Extracts one date information from text date_matcher = ( DateMatcher() .setInputCols("document") .setOutputCol("date") .setOutputFormat("yyyy/MM/dd") ) nlp_pipeline = Pipeline(stages=[document_assembler, date_matcher]) text_list = ["See you on next monday.", "She was born on 02/03/1966.", "The project started yesterday and will finish next year.", "She will graduate by July 2023.", "She will visit doctor tomorrow and next month again."] # Create a dataframe spark_df = spark.createDataFrame(text_list, StringType()).toDF("text") # Fit the pipeline and get predictions result = nlp_pipeline.fit(spark_df).transform(spark_df) # Display the extracted date information result.selectExpr("text", "date.result as date").show(truncate=False) ''', language='python') st.text(""" +--------------------------------------------------------+------------+ |text |date | +--------------------------------------------------------+------------+ |See you on next monday. |[2024/07/08]| |She was born on 02/03/1966. |[1966/02/03]| |The project started yesterday and will finish next year.|[2025/07/06]| |She will graduate by July 2023. |[2023/07/01]| |She will visit doctor tomorrow and next month again. |[2024/08/06]| +--------------------------------------------------------+------------+ """) st.markdown("""

The code snippet demonstrates how to set up a pipeline in Spark NLP to extract single date patterns from text data using the DateMatcher annotator. The resulting DataFrame contains the matched date patterns.

""", unsafe_allow_html=True) # Using MultiDateMatcher in Spark NLP st.markdown('

Using MultiDateMatcher in Spark NLP

', unsafe_allow_html=True) st.markdown("""

The MultiDateMatcher annotator in Spark NLP extends the capabilities of the DateMatcher by allowing extraction of multiple date patterns from text data. This is useful when a text contains several dates.

The MultiDateMatcher annotator in Spark NLP offers:

Flexible date pattern matching
Extraction of multiple date occurrences
Efficient processing of large text datasets
Integration with other Spark NLP components for comprehensive NLP pipelines

""", unsafe_allow_html=True) # Multi Date Extraction Example st.markdown('

Example Usage: Multiple Date Extraction with MultiDateMatcher

', unsafe_allow_html=True) st.code(''' from sparknlp.annotator import MultiDateMatcher # Step 1: Transforms raw texts to `document` annotation document_assembler = ( DocumentAssembler() .setInputCol("text") .setOutputCol("document") ) # Step 2: Extracts multiple date information from text multi_date_matcher = ( MultiDateMatcher() .setInputCols("document") .setOutputCol("multi_date") .setOutputFormat("MM/dd/yy") ) nlp_pipeline = Pipeline(stages=[document_assembler, multi_date_matcher]) text_list = ["See you on next monday.", "She was born on 02/03/1966.", "The project started yesterday and will finish next year.", "She will graduate by July 2023.", "She will visit doctor tomorrow and next month again."] # Create a dataframe spark_df = spark.createDataFrame(text_list, StringType()).toDF("text") # Fit the pipeline and get predictions result = nlp_pipeline.fit(spark_df).transform(spark_df) # Display the extracted date information result.selectExpr("text", "multi_date.result as multi_date").show(truncate=False) ''', language='python') st.text(""" +--------------------------------------------------------+--------------------+ |text |multi_date | +--------------------------------------------------------+--------------------+ |See you on next monday. |[07/08/24] | |She was born on 02/03/1966. |[02/03/66] | |The project started yesterday and will finish next year.|[07/06/25, 07/05/24]| |She will graduate by July 2023. |[07/01/23] | |She will visit doctor tomorrow and next month again. |[08/06/24, 07/07/24]| +--------------------------------------------------------+--------------------+ """) st.markdown("""

The code snippet demonstrates how to set up a pipeline in Spark NLP to extract multiple date patterns from text data using the MultiDateMatcher annotator. The resulting DataFrame contains the matched date patterns.

""", unsafe_allow_html=True) # Handling Relative Dates st.markdown('

Handling Relative Dates

', unsafe_allow_html=True) st.write("") st.markdown("""

DateMatcher and MultiDateMatcher annotators in Spark NLP can also handle relative dates such as "tomorrow," "next week," or "last year." To achieve this, you need to set a reference (or anchor) date, which the annotators will use as a base to interpret the relative dates mentioned in the text.

""", unsafe_allow_html=True) st.code(''' # Step 1: Transforms raw texts to `document` annotation document_assembler = ( DocumentAssembler() .setInputCol("text") .setOutputCol("document") ) # Step 2: Set anchor day, month and year multi_date_matcher = ( MultiDateMatcher() .setInputCols("document") .setOutputCol("multi_date") .setOutputFormat("MM/dd/yyyy") .setAnchorDateYear(2024) .setAnchorDateMonth(7) .setAnchorDateDay(6) ) nlp_pipeline = Pipeline(stages=[document_assembler, multi_date_matcher]) text_list = ["See you on next monday.", "She was born on 02/03/1966.", "The project started yesterday and will finish next year.", "She will graduate by July 2023.", "She will visit doctor tomorrow and next month again."] # Create a dataframe spark_df = spark.createDataFrame(text_list, StringType()).toDF("text") # Fit the pipeline and get predictions result = nlp_pipeline.fit(spark_df).transform(spark_df) # Display the extracted date information result.selectExpr("text", "multi_date.result as multi_date").show(truncate=False) ''', language='python') st.text(""" +--------------------------------------------------------+------------------------+ |text |multi_date | +--------------------------------------------------------+------------------------+ |See you on next monday. |[07/08/2024] | |She was born on 02/03/1966. |[02/03/1966] | |The project started yesterday and will finish next year.|[07/06/2025, 07/05/2024]| |She will graduate by July 2023. |[07/01/2023] | |She will visit doctor tomorrow and next month again. |[08/06/2024, 07/07/2024]| +--------------------------------------------------------+------------------------+ """) st.markdown("""

This code snippet shows how to handle relative dates by setting an anchor date for the MultiDateMatcher annotator. The anchor date helps in converting relative date references to absolute dates.

""", unsafe_allow_html=True) st.markdown("""

Conclusion

In this app, we demonstrated how to use Spark NLP's DateMatcher and MultiDateMatcher annotators to extract dates from text data. These powerful tools enable users to efficiently process large datasets and identify date patterns, whether single or multiple occurrences, including handling relative dates with ease. By integrating these annotators into your NLP pipelines, you can enhance the extraction of valuable temporal information from unstructured text, providing deeper insights for various applications.

""", unsafe_allow_html=True) # References and Additional Information st.markdown('

References

', unsafe_allow_html=True) st.markdown("""

Documentation : DateMatcher, MultiDateMatcher
Python Doc : DateMatcher, MultiDateMatcher
Scala Doc : DateMatcher, MultiDateMatcher
For extended examples of usage, see the Spark NLP Workshop.

""", unsafe_allow_html=True) st.markdown('

Community & Support

', unsafe_allow_html=True) st.markdown("""

Official Website: Documentation and examples
Slack: Live discussion with the community and team
GitHub: Bug reports, feature requests, and contributions
Medium: Spark NLP articles
YouTube: Video tutorials

""", unsafe_allow_html=True)