File size: 11,882 Bytes
38c64a9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
import streamlit as st

# Custom CSS for better styling
st.markdown("""

    <style>

        .main-title {

            font-size: 36px;

            color: #4A90E2;

            font-weight: bold;

            text-align: center;

        }

        .sub-title {

            font-size: 24px;

            color: #4A90E2;

            margin-top: 20px;

        }

        .section {

            background-color: #f9f9f9;

            padding: 15px;

            border-radius: 10px;

            margin-top: 20px;

        }

        .section h2 {

            font-size: 22px;

            color: #4A90E2;

        }

        .section p, .section ul {

            color: #666666;

        }

        .link {

            color: #4A90E2;

            text-decoration: none;

        }

    </style>

""", unsafe_allow_html=True)

# Introduction
st.markdown('<div class="main-title">Coreference Resolution with BERT-based Models in Spark NLP</div>', unsafe_allow_html=True)

st.markdown("""

<div class="section">

    <p>Welcome to the Spark NLP Coreference Resolution Demo App! Coreference resolution is a crucial task in Natural Language Processing (NLP) that involves identifying and linking all expressions within a text that refer to the same real-world entity. This can be useful for a wide range of applications, such as text understanding, information extraction, and question answering.</p>

    <p>Using Spark NLP, it is possible to perform coreference resolution with high accuracy using BERT-based models. This app demonstrates how to use the SpanBertCoref annotator to resolve coreferences in text data.</p>

</div>

""", unsafe_allow_html=True)

st.image('images/Coreference-Resolution.png', use_column_width='auto')

# About Coreference Resolution
st.markdown('<div class="sub-title">About Coreference Resolution</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <p>Coreference resolution is the task of identifying and linking all expressions within a text that refer to the same real-world entity, such as a person, object, or concept. This technique involves analyzing a text and identifying all expressions that refer to a specific entity, such as “he,” “she,” “it,” or “they.” These expressions are then linked together to form a “coreference chain,” representing all the different ways that entity is referred to in the text.</p>

    <p>For example, given the sentence, “John went to the store. He bought some groceries,” a coreference resolution model would identify that “John” and “He” both refer to the same entity and produce a cluster of coreferent mentions.</p>

</div>

""", unsafe_allow_html=True)

# Using SpanBertCoref in Spark NLP
st.markdown('<div class="sub-title">Using SpanBertCoref in Spark NLP</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <p>The SpanBertCoref annotator in Spark NLP allows users to perform coreference resolution with high accuracy using BERT-based models. This annotator can identify and link expressions that refer to the same entity in text data, providing valuable insights from unstructured text data.</p>

    <p>The SpanBertCoref annotator in Spark NLP offers:</p>

    <ul>

        <li>Accurate coreference resolution using BERT-based models</li>

        <li>Identification and linking of multiple coreferent expressions</li>

        <li>Efficient processing of large text datasets</li>

        <li>Integration with other Spark NLP components for comprehensive NLP pipelines</li>

    </ul>

</div>

""", unsafe_allow_html=True)

st.markdown('<h2 class="sub-title">Example Usage in Python</h2>', unsafe_allow_html=True)
st.markdown('<p>Here’s how you can implement coreference resolution using the SpanBertCoref annotator in Spark NLP:</p>', unsafe_allow_html=True)

# Setup Instructions
st.markdown('<div class="sub-title">Setup</div>', unsafe_allow_html=True)
st.markdown('<p>To install Spark NLP in Python, use your favorite package manager (conda, pip, etc.). For example:</p>', unsafe_allow_html=True)
st.code("""

pip install spark-nlp

pip install pyspark

""", language="bash")

st.markdown("<p>Then, import Spark NLP and start a Spark session:</p>", unsafe_allow_html=True)
st.code("""

import sparknlp



# Start Spark Session

spark = sparknlp.start()

""", language='python')

# Coreference Resolution Example
st.markdown('<div class="sub-title">Example Usage: Coreference Resolution with SpanBertCoref</div>', unsafe_allow_html=True)
st.code('''

from sparknlp.base import DocumentAssembler, Pipeline

from sparknlp.annotator import (

    SentenceDetector,

    Tokenizer,

    SpanBertCorefModel

)

import pyspark.sql.functions as F



# Step 1: Transforms raw texts to document annotation

document = DocumentAssembler() \\

            .setInputCol("text") \\

            .setOutputCol("document")



# Step 2: Sentence Detection

sentenceDetector = SentenceDetector() \\

            .setInputCols("document") \\

            .setOutputCol("sentences")



# Step 3: Tokenization

token = Tokenizer() \\

            .setInputCols("sentences") \\

            .setOutputCol("tokens") \\

            .setContextChars(["(", ")", "?", "!", ".", ","])



# Step 4: Coreference Resolution

corefResolution = SpanBertCorefModel().pretrained("spanbert_base_coref") \\

            .setInputCols(["sentences", "tokens"]) \\

            .setOutputCol("corefs") \\

            .setCaseSensitive(False)



# Define the pipeline

pipeline = Pipeline(stages=[document, sentenceDetector, token, corefResolution])



# Create the dataframe

data = spark.createDataFrame([["Ana is a Graduate Student at UT Dallas. She loves working in Natural Language Processing at the Institute. Her hobbies include blogging, dancing, and singing."]]).toDF("text")



# Fit the dataframe to the pipeline to get the model

model = pipeline.fit(data)



# Transform the data to get predictions

result = model.transform(data)



# Display the extracted coreferences

result.selectExpr("explode(corefs) AS coref").selectExpr("coref.result as token", "coref.metadata").show(truncate=False)

''', language='python')

st.text("""

+-------------+----------------------------------------------------------------------------------------+

|token        |metadata                                                                                |

+-------------+----------------------------------------------------------------------------------------+

|ana          |{head.sentence -> -1, head -> ROOT, head.begin -> -1, head.end -> -1, sentence -> 0}    |

|she          |{head.sentence -> 0, head -> ana, head.begin -> 0, head.end -> 2, sentence -> 1}        |

|her          |{head.sentence -> 0, head -> ana, head.begin -> 0, head.end -> 2, sentence -> 2}        |

|ut dallas    |{head.sentence -> -1, head -> ROOT, head.begin -> -1, head.end -> -1, sentence -> 0}    |

|the institute|{head.sentence -> 0, head -> ut dallas, head.begin -> 29, head.end -> 37, sentence -> 1}|

+-------------+----------------------------------------------------------------------------------------+

""")

st.markdown("""

<p>The code snippet demonstrates how to set up a pipeline in Spark NLP to resolve coreferences in text data using the SpanBertCoref annotator. The resulting DataFrame contains the coreferent mentions and their metadata.</p>

""", unsafe_allow_html=True)

# One-liner Alternative
st.markdown('<div class="sub-title">One-liner Alternative</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <p>In October 2022, John Snow Labs released the open-source <code>johnsnowlabs</code> library that contains all the company products, open-source and licensed, under one common library. This simplified the workflow, especially for users working with more than one of the libraries (e.g., Spark NLP + Healthcare NLP). This new library is a wrapper on all of John Snow Lab’s libraries and can be installed with pip:</p>

    <p><code>pip install johnsnowlabs</code></p>

    <p>To run coreference resolution with one line of code, we can simply:</p>

</div>

""", unsafe_allow_html=True)
st.code("""

# Import the NLP module which contains Spark NLP and NLU libraries

from johnsnowlabs import nlp



sample_text = "Ana is a Graduate Student at UT Dallas. She loves working in Natural Language Processing at the Institute. Her hobbies include blogging, dancing, and singing."



# Returns a pandas DataFrame, we select the desired columns

nlp.load('en.coreference.spanbert').predict(sample_text, output_level='sentence')

""", language='python')

st.image('images/johnsnowlabs-output.png', use_column_width='auto')

st.markdown("""

<p>This approach demonstrates how to use the <code>johnsnowlabs</code> library to perform coreference resolution with a single line of code. The resulting DataFrame contains the coreferent mentions and their metadata.</p>

""", unsafe_allow_html=True)

# Conclusion
st.markdown("""

<div class="section">

    <h2>Conclusion</h2>

    <p>In this app, we demonstrated how to use Spark NLP's SpanBertCoref annotator to resolve coreferences in text data. These powerful tools enable users to efficiently process large datasets and identify coreferent mentions, providing deeper insights for various applications. By integrating these annotators into your NLP pipelines, you can enhance the extraction of valuable entity relationships from unstructured text, improving text understanding, information extraction, and question answering.</p>

</div>

""", unsafe_allow_html=True)

# References and Additional Information
st.markdown('<div class="sub-title">For additional information, please check the following references.</div>', unsafe_allow_html=True)

st.markdown("""

<div class="section">

        <ul>

            <li>Documentation :&nbsp;<a href="https://nlp.johnsnowlabs.com/docs/en/transformers#spanbertcoref" target="_blank" rel="noopener">SpanBertCoref</a></li>

            <li>Python Docs :&nbsp;<a href="https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/coref/spanbert_coref/index.html#sparknlp.annotator.coref.spanbert_coref.SpanBertCorefModel" target="_blank" rel="noopener">SpanBertCoref</a></li>

            <li>Scala Docs :&nbsp;<a href="https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/coref/SpanBertCorefModel.html" target="_blank" rel="noopener">SpanBertCoref</a></li>

            <li>Academic Reference Paper:&nbsp;SpanBERT: <a href="https://arxiv.org/abs/1907.10529" target="_blank" rel="noopener nofollow">Improving Pre-training by Representing and Predicting Spans</a></li>

            <li>John Snow Labs&nbsp;<a href="https://nlp.johnsnowlabs.com/2022/06/14/spanbert_base_coref_en_3_0.html" target="_blank" rel="noopener">SpanBertCoref Model</a></li>

        </ul>

</div>

""", unsafe_allow_html=True)

st.markdown('<div class="sub-title">Community & Support</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <ul>

        <li><a class="link" href="https://sparknlp.org/" target="_blank">Official Website</a>: Documentation and examples</li>

        <li><a class="link" href="https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q" target="_blank">Slack</a>: Live discussion with the community and team</li>

        <li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp" target="_blank">GitHub</a>: Bug reports, feature requests, and contributions</li>

        <li><a class="link" href="https://medium.com/spark-nlp" target="_blank">Medium</a>: Spark NLP articles</li>

        <li><a class="link" href="https://www.youtube.com/channel/UCmFOjlpYEhxf_wJUDuz6xxQ/videos" target="_blank">YouTube</a>: Video tutorials</li>

    </ul>

</div>

""", unsafe_allow_html=True)