abdullahmubeen10
commited on
Commit
•
e8a1598
1
Parent(s):
de6149d
Update pages/Workflow & Model Overview.py
Browse files- pages/Workflow & Model Overview.py +172 -172
pages/Workflow & Model Overview.py
CHANGED
@@ -1,173 +1,173 @@
|
|
1 |
-
import streamlit as st
|
2 |
-
|
3 |
-
# Custom CSS for better styling
|
4 |
-
st.markdown("""
|
5 |
-
<style>
|
6 |
-
.main-title {
|
7 |
-
font-size: 36px;
|
8 |
-
color: #4A90E2;
|
9 |
-
font-weight: bold;
|
10 |
-
text-align: center;
|
11 |
-
}
|
12 |
-
.sub-title {
|
13 |
-
font-size: 24px;
|
14 |
-
color: #333333;
|
15 |
-
margin-top: 20px;
|
16 |
-
}
|
17 |
-
.section {
|
18 |
-
background-color: #f9f9f9;
|
19 |
-
padding: 15px;
|
20 |
-
border-radius: 10px;
|
21 |
-
margin-top: 20px;
|
22 |
-
}
|
23 |
-
.section h2 {
|
24 |
-
font-size: 22px;
|
25 |
-
color: #4A90E2;
|
26 |
-
}
|
27 |
-
.section p, .section ul {
|
28 |
-
color: #666666;
|
29 |
-
}
|
30 |
-
.link {
|
31 |
-
color: #4A90E2;
|
32 |
-
text-decoration: none;
|
33 |
-
}
|
34 |
-
</style>
|
35 |
-
""", unsafe_allow_html=True)
|
36 |
-
|
37 |
-
# Introduction
|
38 |
-
st.markdown('<div class="main-title">State-of-the-Art Text Summarization with Spark NLP</div>', unsafe_allow_html=True)
|
39 |
-
|
40 |
-
st.markdown("""
|
41 |
-
<div class="section">
|
42 |
-
<p>Welcome to the Spark NLP Demos App! In the rapidly evolving field of Natural Language Processing (NLP), the combination of powerful models and scalable frameworks is crucial. One such resource-intensive task is Text Summarization, which benefits immensely from the efficient implementation of machine learning models on distributed systems like Apache Spark.</p>
|
43 |
-
<p>Spark NLP stands out as the leading choice for enterprises building NLP solutions. This open-source library, built in Scala with a Python wrapper, offers state-of-the-art machine learning models within an easy-to-use pipeline design compatible with Spark ML.</p>
|
44 |
-
</div>
|
45 |
-
""", unsafe_allow_html=True)
|
46 |
-
|
47 |
-
# About the T5 Model
|
48 |
-
st.markdown('<div class="sub-title">About the T5 Model</div>', unsafe_allow_html=True)
|
49 |
-
st.markdown("""
|
50 |
-
<div class="section">
|
51 |
-
<p>A standout model for text summarization is the Text-to-Text Transformer (T5), introduced by Google researchers in 2019. T5 achieves remarkable results by utilizing a unique design that allows it to perform multiple NLP tasks with simple prefixes. For text summarization, the input text is prefixed with "summarize:".</p>
|
52 |
-
<p>In Spark NLP, the T5 model is available through the T5Transformer annotator. We'll show you how to use Spark NLP in Python to perform text summarization using the T5 model.</p>
|
53 |
-
</div>
|
54 |
-
""", unsafe_allow_html=True)
|
55 |
-
|
56 |
-
st.image('https://www.johnsnowlabs.com/wp-content/uploads/2023/09/img_blog_2.jpg', caption='Diagram of the T5 model, from the original paper', use_column_width='auto')
|
57 |
-
|
58 |
-
# How to Use the Model
|
59 |
-
st.markdown('<div class="sub-title">How to Use the T5 Model with Spark NLP</div>', unsafe_allow_html=True)
|
60 |
-
st.markdown("""
|
61 |
-
<div class="section">
|
62 |
-
<p>To use the T5Transformer annotator in Spark NLP to perform text summarization, we need to create a pipeline with two stages: the first transforms the input text into an annotation object, and the second stage contains the T5 model.</p>
|
63 |
-
</div>
|
64 |
-
""", unsafe_allow_html=True)
|
65 |
-
|
66 |
-
st.markdown('
|
67 |
-
st.code('!pip install spark-nlp', language='python')
|
68 |
-
|
69 |
-
st.markdown('
|
70 |
-
st.code("""
|
71 |
-
import sparknlp
|
72 |
-
from sparknlp.base import DocumentAssembler, PipelineModel
|
73 |
-
from sparknlp.annotator import T5Transformer
|
74 |
-
|
75 |
-
# Start the Spark Session
|
76 |
-
spark = sparknlp.start()
|
77 |
-
""", language='python')
|
78 |
-
|
79 |
-
st.markdown("""
|
80 |
-
<div class="section">
|
81 |
-
<p>Now we can define the pipeline to use the T5 model. We'll use the PipelineModel object since we are using the pretrained model and don’t need to train any stage of the pipeline.</p>
|
82 |
-
</div>
|
83 |
-
""", unsafe_allow_html=True)
|
84 |
-
|
85 |
-
st.markdown('
|
86 |
-
st.code("""
|
87 |
-
# Transforms raw texts into `document` annotation
|
88 |
-
document_assembler = (
|
89 |
-
DocumentAssembler().setInputCol("text").setOutputCol("documents")
|
90 |
-
)
|
91 |
-
# The T5 model
|
92 |
-
t5 = (
|
93 |
-
T5Transformer.pretrained("t5_small")
|
94 |
-
.setTask("summarize:")
|
95 |
-
.setInputCols(["documents"])
|
96 |
-
.setMaxOutputLength(200)
|
97 |
-
.setOutputCol("t5")
|
98 |
-
)
|
99 |
-
# Define the Spark pipeline
|
100 |
-
pipeline = PipelineModel(stages = [document_assembler, t5])
|
101 |
-
""", language='python')
|
102 |
-
|
103 |
-
st.markdown("""
|
104 |
-
<div class="section">
|
105 |
-
<p>To use the model, create a Spark DataFrame containing the input data. In this example, we'll work with a single sentence, but the framework can handle multiple texts for simultaneous processing. The input column from the DocumentAssembler annotator requires a column named “text.”</p>
|
106 |
-
</div>
|
107 |
-
""", unsafe_allow_html=True)
|
108 |
-
|
109 |
-
st.markdown('
|
110 |
-
st.code("""
|
111 |
-
example = \"""
|
112 |
-
Transfer learning, where a model is first pre-trained on a data-rich task
|
113 |
-
before being fine-tuned on a downstream task, has emerged as a powerful
|
114 |
-
technique in natural language processing (NLP). The effectiveness of transfer
|
115 |
-
learning has given rise to a diversity of approaches, methodology, and
|
116 |
-
practice. In this paper, we explore the landscape of transfer learning
|
117 |
-
techniques for NLP by introducing a unified framework that converts all
|
118 |
-
text-based language problems into a text-to-text format.
|
119 |
-
Our systematic study compares pre-training objectives, architectures,
|
120 |
-
unlabeled data sets, transfer approaches, and other factors on dozens of
|
121 |
-
language understanding tasks. By combining the insights from our exploration
|
122 |
-
with scale and our new Colossal Clean Crawled Corpus, we achieve
|
123 |
-
state-of-the-art results on many benchmarks covering summarization,
|
124 |
-
question answering, text classification, and more. To facilitate future
|
125 |
-
work on transfer learning for NLP, we release our data set, pre-trained
|
126 |
-
models, and code.
|
127 |
-
\"""
|
128 |
-
|
129 |
-
spark_df = spark.createDataFrame([[example]])
|
130 |
-
""", language='python')
|
131 |
-
|
132 |
-
st.markdown('
|
133 |
-
st.code("""
|
134 |
-
result = pipeline.transform(spark_df)
|
135 |
-
result.select("t5.result").show(truncate=False)
|
136 |
-
""", language='python')
|
137 |
-
|
138 |
-
st.markdown('<div class="sub-title">Output</div>', unsafe_allow_html=True)
|
139 |
-
st.markdown("""
|
140 |
-
<div class="section">
|
141 |
-
<p>The summarization output will look something like this:</p>
|
142 |
-
<pre>transfer learning has emerged as a powerful technique in natural language processing (NLP) the effectiveness of transfer learning has given rise to a diversity of approaches, methodologies, and practice.</pre>
|
143 |
-
<p>Note: We defined the maximum output length to 200. Depending on the length of the original text, this parameter should be adapted.</p>
|
144 |
-
</div>
|
145 |
-
""", unsafe_allow_html=True)
|
146 |
-
|
147 |
-
# Additional Resources and References
|
148 |
-
st.markdown('<div class="sub-title">Additional Resources and References</div>', unsafe_allow_html=True)
|
149 |
-
st.markdown("""
|
150 |
-
<div class="section">
|
151 |
-
<ul>
|
152 |
-
<li><a class="link" href="https://sparknlp.org/docs/en/transformers#t5transformer" target="_blank">T5Transformer documentation page</a></li>
|
153 |
-
<li><a class="link" href="https://arxiv.org/abs/1910.10683" target="_blank">T5 paper</a></li>
|
154 |
-
<li><a class="link" href="https://sparknlp.org/docs/en/quickstart" target="_blank">Getting Started with Spark NLP</a></li>
|
155 |
-
<li><a class="link" href="https://nlp.johnsnowlabs.com/models" target="_blank">Pretrained Models</a></li>
|
156 |
-
<li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples/python/annotation/text/english" target="_blank">Example Notebooks</a></li>
|
157 |
-
<li><a class="link" href="https://sparknlp.org/docs/en/install" target="_blank">Installation Guide</a></li>
|
158 |
-
</ul>
|
159 |
-
</div>
|
160 |
-
""", unsafe_allow_html=True)
|
161 |
-
|
162 |
-
st.markdown('<div class="sub-title">Community & Support</div>', unsafe_allow_html=True)
|
163 |
-
st.markdown("""
|
164 |
-
<div class="section">
|
165 |
-
<ul>
|
166 |
-
<li><a class="link" href="https://sparknlp.org/" target="_blank">Official Website</a>: Documentation and examples</li>
|
167 |
-
<li><a class="link" href="https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q" target="_blank">Slack</a>: Live discussion with the community and team</li>
|
168 |
-
<li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp" target="_blank">GitHub</a>: Bug reports, feature requests, and contributions</li>
|
169 |
-
<li><a class="link" href="https://medium.com/spark-nlp" target="_blank">Medium</a>: Spark NLP articles</li>
|
170 |
-
<li><a class="link" href="https://www.youtube.com/channel/UCmFOjlpYEhxf_wJUDuz6xxQ/videos" target="_blank">YouTube</a>: Video tutorials</li>
|
171 |
-
</ul>
|
172 |
-
</div>
|
173 |
""", unsafe_allow_html=True)
|
|
|
1 |
+
import streamlit as st
|
2 |
+
|
3 |
+
# Custom CSS for better styling
|
4 |
+
st.markdown("""
|
5 |
+
<style>
|
6 |
+
.main-title {
|
7 |
+
font-size: 36px;
|
8 |
+
color: #4A90E2;
|
9 |
+
font-weight: bold;
|
10 |
+
text-align: center;
|
11 |
+
}
|
12 |
+
.sub-title {
|
13 |
+
font-size: 24px;
|
14 |
+
color: #333333;
|
15 |
+
margin-top: 20px;
|
16 |
+
}
|
17 |
+
.section {
|
18 |
+
background-color: #f9f9f9;
|
19 |
+
padding: 15px;
|
20 |
+
border-radius: 10px;
|
21 |
+
margin-top: 20px;
|
22 |
+
}
|
23 |
+
.section h2 {
|
24 |
+
font-size: 22px;
|
25 |
+
color: #4A90E2;
|
26 |
+
}
|
27 |
+
.section p, .section ul {
|
28 |
+
color: #666666;
|
29 |
+
}
|
30 |
+
.link {
|
31 |
+
color: #4A90E2;
|
32 |
+
text-decoration: none;
|
33 |
+
}
|
34 |
+
</style>
|
35 |
+
""", unsafe_allow_html=True)
|
36 |
+
|
37 |
+
# Introduction
|
38 |
+
st.markdown('<div class="main-title">State-of-the-Art Text Summarization with Spark NLP</div>', unsafe_allow_html=True)
|
39 |
+
|
40 |
+
st.markdown("""
|
41 |
+
<div class="section">
|
42 |
+
<p>Welcome to the Spark NLP Demos App! In the rapidly evolving field of Natural Language Processing (NLP), the combination of powerful models and scalable frameworks is crucial. One such resource-intensive task is Text Summarization, which benefits immensely from the efficient implementation of machine learning models on distributed systems like Apache Spark.</p>
|
43 |
+
<p>Spark NLP stands out as the leading choice for enterprises building NLP solutions. This open-source library, built in Scala with a Python wrapper, offers state-of-the-art machine learning models within an easy-to-use pipeline design compatible with Spark ML.</p>
|
44 |
+
</div>
|
45 |
+
""", unsafe_allow_html=True)
|
46 |
+
|
47 |
+
# About the T5 Model
|
48 |
+
st.markdown('<div class="sub-title">About the T5 Model</div>', unsafe_allow_html=True)
|
49 |
+
st.markdown("""
|
50 |
+
<div class="section">
|
51 |
+
<p>A standout model for text summarization is the Text-to-Text Transformer (T5), introduced by Google researchers in 2019. T5 achieves remarkable results by utilizing a unique design that allows it to perform multiple NLP tasks with simple prefixes. For text summarization, the input text is prefixed with "summarize:".</p>
|
52 |
+
<p>In Spark NLP, the T5 model is available through the T5Transformer annotator. We'll show you how to use Spark NLP in Python to perform text summarization using the T5 model.</p>
|
53 |
+
</div>
|
54 |
+
""", unsafe_allow_html=True)
|
55 |
+
|
56 |
+
st.image('https://www.johnsnowlabs.com/wp-content/uploads/2023/09/img_blog_2.jpg', caption='Diagram of the T5 model, from the original paper', use_column_width='auto')
|
57 |
+
|
58 |
+
# How to Use the Model
|
59 |
+
st.markdown('<div class="sub-title">How to Use the T5 Model with Spark NLP</div>', unsafe_allow_html=True)
|
60 |
+
st.markdown("""
|
61 |
+
<div class="section">
|
62 |
+
<p>To use the T5Transformer annotator in Spark NLP to perform text summarization, we need to create a pipeline with two stages: the first transforms the input text into an annotation object, and the second stage contains the T5 model.</p>
|
63 |
+
</div>
|
64 |
+
""", unsafe_allow_html=True)
|
65 |
+
|
66 |
+
st.markdown('<div class="sub-title">Installation</div>')
|
67 |
+
st.code('!pip install spark-nlp', language='python')
|
68 |
+
|
69 |
+
st.markdown('<div class="sub-title">Import Libraries and Start Spark Session</div>')
|
70 |
+
st.code("""
|
71 |
+
import sparknlp
|
72 |
+
from sparknlp.base import DocumentAssembler, PipelineModel
|
73 |
+
from sparknlp.annotator import T5Transformer
|
74 |
+
|
75 |
+
# Start the Spark Session
|
76 |
+
spark = sparknlp.start()
|
77 |
+
""", language='python')
|
78 |
+
|
79 |
+
st.markdown("""
|
80 |
+
<div class="section">
|
81 |
+
<p>Now we can define the pipeline to use the T5 model. We'll use the PipelineModel object since we are using the pretrained model and don’t need to train any stage of the pipeline.</p>
|
82 |
+
</div>
|
83 |
+
""", unsafe_allow_html=True)
|
84 |
+
|
85 |
+
st.markdown('<div class="sub-title">Define the Pipeline</div>')
|
86 |
+
st.code("""
|
87 |
+
# Transforms raw texts into `document` annotation
|
88 |
+
document_assembler = (
|
89 |
+
DocumentAssembler().setInputCol("text").setOutputCol("documents")
|
90 |
+
)
|
91 |
+
# The T5 model
|
92 |
+
t5 = (
|
93 |
+
T5Transformer.pretrained("t5_small")
|
94 |
+
.setTask("summarize:")
|
95 |
+
.setInputCols(["documents"])
|
96 |
+
.setMaxOutputLength(200)
|
97 |
+
.setOutputCol("t5")
|
98 |
+
)
|
99 |
+
# Define the Spark pipeline
|
100 |
+
pipeline = PipelineModel(stages = [document_assembler, t5])
|
101 |
+
""", language='python')
|
102 |
+
|
103 |
+
st.markdown("""
|
104 |
+
<div class="section">
|
105 |
+
<p>To use the model, create a Spark DataFrame containing the input data. In this example, we'll work with a single sentence, but the framework can handle multiple texts for simultaneous processing. The input column from the DocumentAssembler annotator requires a column named “text.”</p>
|
106 |
+
</div>
|
107 |
+
""", unsafe_allow_html=True)
|
108 |
+
|
109 |
+
st.markdown('<div class="sub-title">Create Example DataFrame</div>')
|
110 |
+
st.code("""
|
111 |
+
example = \"""
|
112 |
+
Transfer learning, where a model is first pre-trained on a data-rich task
|
113 |
+
before being fine-tuned on a downstream task, has emerged as a powerful
|
114 |
+
technique in natural language processing (NLP). The effectiveness of transfer
|
115 |
+
learning has given rise to a diversity of approaches, methodology, and
|
116 |
+
practice. In this paper, we explore the landscape of transfer learning
|
117 |
+
techniques for NLP by introducing a unified framework that converts all
|
118 |
+
text-based language problems into a text-to-text format.
|
119 |
+
Our systematic study compares pre-training objectives, architectures,
|
120 |
+
unlabeled data sets, transfer approaches, and other factors on dozens of
|
121 |
+
language understanding tasks. By combining the insights from our exploration
|
122 |
+
with scale and our new Colossal Clean Crawled Corpus, we achieve
|
123 |
+
state-of-the-art results on many benchmarks covering summarization,
|
124 |
+
question answering, text classification, and more. To facilitate future
|
125 |
+
work on transfer learning for NLP, we release our data set, pre-trained
|
126 |
+
models, and code.
|
127 |
+
\"""
|
128 |
+
|
129 |
+
spark_df = spark.createDataFrame([[example]])
|
130 |
+
""", language='python')
|
131 |
+
|
132 |
+
st.markdown('<div class="sub-title">Apply the Pipeline</div>')
|
133 |
+
st.code("""
|
134 |
+
result = pipeline.transform(spark_df)
|
135 |
+
result.select("t5.result").show(truncate=False)
|
136 |
+
""", language='python')
|
137 |
+
|
138 |
+
st.markdown('<div class="sub-title">Output</div>', unsafe_allow_html=True)
|
139 |
+
st.markdown("""
|
140 |
+
<div class="section">
|
141 |
+
<p>The summarization output will look something like this:</p>
|
142 |
+
<pre>transfer learning has emerged as a powerful technique in natural language processing (NLP) the effectiveness of transfer learning has given rise to a diversity of approaches, methodologies, and practice.</pre>
|
143 |
+
<p>Note: We defined the maximum output length to 200. Depending on the length of the original text, this parameter should be adapted.</p>
|
144 |
+
</div>
|
145 |
+
""", unsafe_allow_html=True)
|
146 |
+
|
147 |
+
# Additional Resources and References
|
148 |
+
st.markdown('<div class="sub-title">Additional Resources and References</div>', unsafe_allow_html=True)
|
149 |
+
st.markdown("""
|
150 |
+
<div class="section">
|
151 |
+
<ul>
|
152 |
+
<li><a class="link" href="https://sparknlp.org/docs/en/transformers#t5transformer" target="_blank">T5Transformer documentation page</a></li>
|
153 |
+
<li><a class="link" href="https://arxiv.org/abs/1910.10683" target="_blank">T5 paper</a></li>
|
154 |
+
<li><a class="link" href="https://sparknlp.org/docs/en/quickstart" target="_blank">Getting Started with Spark NLP</a></li>
|
155 |
+
<li><a class="link" href="https://nlp.johnsnowlabs.com/models" target="_blank">Pretrained Models</a></li>
|
156 |
+
<li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples/python/annotation/text/english" target="_blank">Example Notebooks</a></li>
|
157 |
+
<li><a class="link" href="https://sparknlp.org/docs/en/install" target="_blank">Installation Guide</a></li>
|
158 |
+
</ul>
|
159 |
+
</div>
|
160 |
+
""", unsafe_allow_html=True)
|
161 |
+
|
162 |
+
st.markdown('<div class="sub-title">Community & Support</div>', unsafe_allow_html=True)
|
163 |
+
st.markdown("""
|
164 |
+
<div class="section">
|
165 |
+
<ul>
|
166 |
+
<li><a class="link" href="https://sparknlp.org/" target="_blank">Official Website</a>: Documentation and examples</li>
|
167 |
+
<li><a class="link" href="https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q" target="_blank">Slack</a>: Live discussion with the community and team</li>
|
168 |
+
<li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp" target="_blank">GitHub</a>: Bug reports, feature requests, and contributions</li>
|
169 |
+
<li><a class="link" href="https://medium.com/spark-nlp" target="_blank">Medium</a>: Spark NLP articles</li>
|
170 |
+
<li><a class="link" href="https://www.youtube.com/channel/UCmFOjlpYEhxf_wJUDuz6xxQ/videos" target="_blank">YouTube</a>: Video tutorials</li>
|
171 |
+
</ul>
|
172 |
+
</div>
|
173 |
""", unsafe_allow_html=True)
|