File size: 15,252 Bytes
88f84f1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
import streamlit as st
from sparknlp.base import DocumentAssembler, Pipeline
from sparknlp.annotator import DateMatcher, MultiDateMatcher
from pyspark.sql.types import StringType
import pyspark.sql.functions as F
import sparknlp

# Custom CSS for better styling
st.markdown("""

    <style>

        .main-title {

            font-size: 36px;

            color: #4A90E2;

            font-weight: bold;

            text-align: center;

        }

        .sub-title {

            font-size: 24px;

            color: #4A90E2;

            margin-top: 20px;

        }

        .section {

            background-color: #f9f9f9;

            padding: 15px;

            border-radius: 10px;

            margin-top: 20px;

        }

        .section h2 {

            font-size: 22px;

            color: #4A90E2;

        }

        .section p, .section ul {

            color: #666666;

        }

        .link {

            color: #4A90E2;

            text-decoration: none;

        }

    </style>

""", unsafe_allow_html=True)

# Introduction
st.markdown('<div class="main-title">Date Extraction with Spark NLP</div>', unsafe_allow_html=True)

st.markdown("""

<div class="section">

    <p>Welcome to the Spark NLP Date Extraction Demo App! Date extraction is a crucial task in Natural Language Processing (NLP) that involves identifying and extracting references to dates in text data. This can be useful for a wide range of applications, such as event scheduling, social media monitoring, and financial forecasting.</p>

    <p>Using Spark NLP, it is possible to identify and extract dates from a text with high accuracy. This app demonstrates how to use the DateMatcher and MultiDateMatcher annotators to extract dates from text data.</p>

</div>

""", unsafe_allow_html=True)

st.image('images/Extracting-Exact-Dates.jpg', use_column_width='auto')

# About Date Extraction
st.markdown('<div class="sub-title">About Date Extraction</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <p>Date extraction involves identifying and extracting references to dates in text data. This can be achieved using various techniques such as regular expressions, Named Entity Recognition (NER), and rule-based systems.</p>

    <p>Spark NLP provides powerful tools for date extraction, including the DateMatcher and MultiDateMatcher annotators, which use pattern matching to extract date expressions from text.</p>

</div>

""", unsafe_allow_html=True)

# Using DateMatcher in Spark NLP
st.markdown('<div class="sub-title">Using DateMatcher in Spark NLP</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <p>The DateMatcher annotator in Spark NLP allows users to extract specific date patterns from text data. This annotator can identify dates in various formats, providing valuable insights from unstructured text data.</p>

    <p>The DateMatcher annotator in Spark NLP offers:</p>

    <ul>

        <li>Flexible date pattern matching</li>

        <li>Extraction of single date occurrences</li>

        <li>Efficient processing of large text datasets</li>

        <li>Integration with other Spark NLP components for comprehensive NLP pipelines</li>

    </ul>

</div>

""", unsafe_allow_html=True)

st.markdown('<h2 class="sub-title">Example Usage in Python</h2>', unsafe_allow_html=True)
st.markdown('<p>Here’s how you can implement DateMatcher and MultiDateMatcher annotators in Spark NLP:</p>', unsafe_allow_html=True)

# Setup Instructions
st.markdown('<div class="sub-title">Setup</div>', unsafe_allow_html=True)
st.markdown('<p>To install Spark NLP in Python, use your favorite package manager (conda, pip, etc.). For example:</p>', unsafe_allow_html=True)
st.code("""

pip install spark-nlp

pip install pyspark

""", language="bash")

st.markdown("<p>Then, import Spark NLP and start a Spark session:</p>", unsafe_allow_html=True)
st.code("""

import sparknlp



# Start Spark Session

spark = sparknlp.start()

""", language='python')

# Single Date Extraction Example
st.markdown('<div class="sub-title">Example Usage: Single Date Extraction with DateMatcher</div>', unsafe_allow_html=True)
st.code('''

from sparknlp.base import DocumentAssembler, Pipeline

from sparknlp.annotator import DateMatcher

import pyspark.sql.functions as F



# Step 1: Transforms raw texts to `document` annotation

document_assembler = (

    DocumentAssembler()

    .setInputCol("text")

    .setOutputCol("document")

)



# Step 2: Extracts one date information from text

date_matcher = (

    DateMatcher()

    .setInputCols("document")

    .setOutputCol("date")

    .setOutputFormat("yyyy/MM/dd")

)



nlp_pipeline = Pipeline(stages=[document_assembler, date_matcher])



text_list = ["See you on next monday.",

             "She was born on 02/03/1966.",

             "The project started yesterday and will finish next year.",

             "She will graduate by July 2023.",

             "She will visit doctor tomorrow and next month again."]



# Create a dataframe

spark_df = spark.createDataFrame(text_list, StringType()).toDF("text")



# Fit the pipeline and get predictions

result = nlp_pipeline.fit(spark_df).transform(spark_df)



# Display the extracted date information

result.selectExpr("text", "date.result as date").show(truncate=False)

''', language='python')

st.text("""

+--------------------------------------------------------+------------+

|text                                                    |date        |

+--------------------------------------------------------+------------+

|See you on next monday.                                 |[2024/07/08]|

|She was born on 02/03/1966.                             |[1966/02/03]|

|The project started yesterday and will finish next year.|[2025/07/06]|

|She will graduate by July 2023.                         |[2023/07/01]|

|She will visit doctor tomorrow and next month again.    |[2024/08/06]|

+--------------------------------------------------------+------------+

""")

st.markdown("""

<p>The code snippet demonstrates how to set up a pipeline in Spark NLP to extract single date patterns from text data using the DateMatcher annotator. The resulting DataFrame contains the matched date patterns.</p>

""", unsafe_allow_html=True)

# Using MultiDateMatcher in Spark NLP
st.markdown('<div class="sub-title">Using MultiDateMatcher in Spark NLP</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <p>The MultiDateMatcher annotator in Spark NLP extends the capabilities of the DateMatcher by allowing extraction of multiple date patterns from text data. This is useful when a text contains several dates.</p>

    <p>The MultiDateMatcher annotator in Spark NLP offers:</p>

    <ul>

        <li>Flexible date pattern matching</li>

        <li>Extraction of multiple date occurrences</li>

        <li>Efficient processing of large text datasets</li>

        <li>Integration with other Spark NLP components for comprehensive NLP pipelines</li>

    </ul>

</div>

""", unsafe_allow_html=True)

# Multi Date Extraction Example
st.markdown('<div class="sub-title">Example Usage: Multiple Date Extraction with MultiDateMatcher</div>', unsafe_allow_html=True)
st.code('''

from sparknlp.annotator import MultiDateMatcher



# Step 1: Transforms raw texts to `document` annotation

document_assembler = (

    DocumentAssembler()

    .setInputCol("text")

    .setOutputCol("document")

)



# Step 2: Extracts multiple date information from text

multi_date_matcher = (

    MultiDateMatcher()

    .setInputCols("document")

    .setOutputCol("multi_date")

    .setOutputFormat("MM/dd/yy")

)



nlp_pipeline = Pipeline(stages=[document_assembler, multi_date_matcher])



text_list = ["See you on next monday.",

             "She was born on 02/03/1966.",

             "The project started yesterday and will finish next year.",

             "She will graduate by July 2023.",

             "She will visit doctor tomorrow and next month again."]



# Create a dataframe

spark_df = spark.createDataFrame(text_list, StringType()).toDF("text")



# Fit the pipeline and get predictions

result = nlp_pipeline.fit(spark_df).transform(spark_df)



# Display the extracted date information

result.selectExpr("text", "multi_date.result as multi_date").show(truncate=False)

''', language='python')

st.text("""

+--------------------------------------------------------+--------------------+

|text                                                    |multi_date          |

+--------------------------------------------------------+--------------------+

|See you on next monday.                                 |[07/08/24]          |

|She was born on 02/03/1966.                             |[02/03/66]          |

|The project started yesterday and will finish next year.|[07/06/25, 07/05/24]|

|She will graduate by July 2023.                         |[07/01/23]          |

|She will visit doctor tomorrow and next month again.    |[08/06/24, 07/07/24]|

+--------------------------------------------------------+--------------------+

""")

st.markdown("""

<p>The code snippet demonstrates how to set up a pipeline in Spark NLP to extract multiple date patterns from text data using the MultiDateMatcher annotator. The resulting DataFrame contains the matched date patterns.</p>

""", unsafe_allow_html=True)

# Handling Relative Dates
st.markdown('<div class="sub-title">Handling Relative Dates</div>', unsafe_allow_html=True)
st.write("")
st.markdown("""<p>DateMatcher and MultiDateMatcher annotators in Spark NLP can also handle relative dates such as "tomorrow," "next week," or "last year." To achieve this, you need to set a reference (or anchor) date, which the annotators will use as a base to interpret the relative dates mentioned in the text.</p>""", unsafe_allow_html=True)
st.code('''

# Step 1: Transforms raw texts to `document` annotation

document_assembler = (

    DocumentAssembler()

    .setInputCol("text")

    .setOutputCol("document")

)



# Step 2: Set anchor day, month and year

multi_date_matcher = (

    MultiDateMatcher()

    .setInputCols("document")

    .setOutputCol("multi_date")

    .setOutputFormat("MM/dd/yyyy")

    .setAnchorDateYear(2024)

    .setAnchorDateMonth(7)

    .setAnchorDateDay(6)

)



nlp_pipeline = Pipeline(stages=[document_assembler, multi_date_matcher])



text_list = ["See you on next monday.",

             "She was born on 02/03/1966.",

             "The project started yesterday and will finish next year.",

             "She will graduate by July 2023.",

             "She will visit doctor tomorrow and next month again."]



# Create a dataframe

spark_df = spark.createDataFrame(text_list, StringType()).toDF("text")



# Fit the pipeline and get predictions

result = nlp_pipeline.fit(spark_df).transform(spark_df)



# Display the extracted date information

result.selectExpr("text", "multi_date.result as multi_date").show(truncate=False)

''', language='python')

st.text("""

+--------------------------------------------------------+------------------------+

|text                                                    |multi_date              |

+--------------------------------------------------------+------------------------+

|See you on next monday.                                 |[07/08/2024]            |

|She was born on 02/03/1966.                             |[02/03/1966]            |

|The project started yesterday and will finish next year.|[07/06/2025, 07/05/2024]|

|She will graduate by July 2023.                         |[07/01/2023]            |

|She will visit doctor tomorrow and next month again.    |[08/06/2024, 07/07/2024]|

+--------------------------------------------------------+------------------------+

""")

st.markdown("""

<p>This code snippet shows how to handle relative dates by setting an anchor date for the MultiDateMatcher annotator. The anchor date helps in converting relative date references to absolute dates.</p>

""", unsafe_allow_html=True)

st.markdown("""

<div class="section">

    <h2>Conclusion</h2>

    <p>In this app, we demonstrated how to use Spark NLP's DateMatcher and MultiDateMatcher annotators to extract dates from text data. These powerful tools enable users to efficiently process large datasets and identify date patterns, whether single or multiple occurrences, including handling relative dates with ease. By integrating these annotators into your NLP pipelines, you can enhance the extraction of valuable temporal information from unstructured text, providing deeper insights for various applications.</p>

</div>

""", unsafe_allow_html=True)

# References and Additional Information
st.markdown('<div class="sub-title">References</div>', unsafe_allow_html=True)

st.markdown("""

<div class="section">

    <ul>

        <li>Documentation : <a href="https://nlp.johnsnowlabs.com/docs/en/annotators#datematcher" target="_blank" rel="noopener">DateMatcher</a>, <a href="https://nlp.johnsnowlabs.com/docs/en/annotators#multidatematcher" target="_blank" rel="noopener">MultiDateMatcher</a></li>

        <li>Python Doc : <a href="https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/matcher/date_matcher/index.html#module-sparknlp.annotator.matcher.date_matcher" target="_blank" rel="noopener">DateMatcher</a>, <a href="https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/matcher/multi_date_matcher/index.html" target="_blank" rel="noopener">MultiDateMatcher</a></li>

        <li>Scala Doc : <a href="https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/DateMatcher.html" target="_blank" rel="noopener">DateMatcher</a>, <a href="https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/MultiDateMatcher.html" target="_blank" rel="noopener">MultiDateMatcher</a></li>

        <li>For extended examples of usage, see the <a href="https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/2.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb" target="_blank" rel="noopener nofollow">Spark NLP Workshop</a>.</li>

    </ul>

</div>

""", unsafe_allow_html=True)

st.markdown('<div class="sub-title">Community & Support</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <ul>

        <li><a class="link" href="https://sparknlp.org/" target="_blank">Official Website</a>: Documentation and examples</li>

        <li><a class="link" href="https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q" target="_blank">Slack</a>: Live discussion with the community and team</li>

        <li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp" target="_blank">GitHub</a>: Bug reports, feature requests, and contributions</li>

        <li><a class="link" href="https://medium.com/spark-nlp" target="_blank">Medium</a>: Spark NLP articles</li>

        <li><a class="link" href="https://www.youtube.com/channel/UCmFOjlpYEhxf_wJUDuz6xxQ/videos" target="_blank">YouTube</a>: Video tutorials</li>

    </ul>

</div>

""", unsafe_allow_html=True)