File size: 8,720 Bytes
6b63571
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
import streamlit as st

# Page configuration
st.set_page_config(
    layout="wide", 
    initial_sidebar_state="auto"
)

# Custom CSS for better styling
st.markdown("""

    <style>

        .main-title {

            font-size: 36px;

            color: #4A90E2;

            font-weight: bold;

            text-align: center;

        }

        .sub-title {

            font-size: 24px;

            color: #4A90E2;

            margin-top: 20px;

        }

        .section {

            background-color: #f9f9f9;

            padding: 15px;

            border-radius: 10px;

            margin-top: 20px;

        }

        .section h2 {

            font-size: 22px;

            color: #4A90E2;

        }

        .section p, .section ul {

            color: #666666;

        }

        .link {

            color: #4A90E2;

            text-decoration: none;

        }

        .benchmark-table {

            width: 100%;

            border-collapse: collapse;

            margin-top: 20px;

        }

        .benchmark-table th, .benchmark-table td {

            border: 1px solid #ddd;

            padding: 8px;

            text-align: left;

        }

        .benchmark-table th {

            background-color: #4A90E2;

            color: white;

        }

        .benchmark-table td {

            background-color: #f2f2f2;

        }

    </style>

""", unsafe_allow_html=True)

# Title
st.markdown('<div class="main-title">Introduction to CamemBERT Annotators in Spark NLP</div>', unsafe_allow_html=True)

# Subtitle
st.markdown("""

<div class="section">

    <p>Spark NLP offers a variety of CamemBERT-based annotators tailored for multiple natural language processing tasks. CamemBERT is a robust and versatile model designed specifically for the French language, offering state-of-the-art performance in a range of NLP applications. Below, we provide an overview of the four key CamemBERT annotators:</p>

</div>

""", unsafe_allow_html=True)

st.markdown("""

<div class="section">

    <h2>CamemBERT for Token Classification</h2>

    <p>The <strong>CamemBertForTokenClassification</strong> annotator is designed for Named Entity Recognition (NER) tasks using CamemBERT, a French language model derived from RoBERTa. This model efficiently handles token classification, which involves labeling tokens in a text with tags that correspond to specific entities. CamemBERT offers robust performance in French NLP tasks, making it a valuable tool for real-time applications in this language.</p>

    <p>Token classification with CamemBERT enables:</p>

    <ul>

        <li><strong>Named Entity Recognition (NER):</strong> Identifying and classifying entities such as names, organizations, locations, and other predefined categories.</li>

        <li><strong>Information Extraction:</strong> Extracting key information from unstructured text for further analysis.</li>

        <li><strong>Text Categorization:</strong> Enhancing document retrieval and categorization based on entity recognition.</li>

    </ul>

    <p>Here is an example of how CamemBERT token classification works:</p>

    <table class="benchmark-table">

        <tr>

            <th>Entity</th>

            <th>Label</th>

        </tr>

        <tr>

            <td>Paris</td>

            <td>LOC</td>

        </tr>

        <tr>

            <td>Emmanuel Macron</td>

            <td>PER</td>

        </tr>

        <tr>

            <td>Élysée Palace</td>

            <td>ORG</td>

        </tr>

    </table>

</div>

""", unsafe_allow_html=True)

# CamemBERT Token Classification - French WikiNER
st.markdown('<div class="sub-title">CamemBERT Token Classification - French WikiNER</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <p>The <strong>camembert_base_token_classifier_wikiner</strong> is a fine-tuned CamemBERT model for token classification tasks, specifically adapted for Named Entity Recognition (NER) on the French WikiNER dataset. It is designed to recognize five types of entities: O, LOC, PER, MISC, and ORG.</p>

</div>

""", unsafe_allow_html=True)

# How to Use the Model - Token Classification
st.markdown('<div class="sub-title">How to Use the Model</div>', unsafe_allow_html=True)
st.code('''

from sparknlp.base import *

from sparknlp.annotator import *

from pyspark.ml import Pipeline

from pyspark.sql.functions import col, expr



document_assembler = DocumentAssembler() \\

    .setInputCol('text') \\

    .setOutputCol('document')



tokenizer = Tokenizer() \\

    .setInputCols(['document']) \\

    .setOutputCol('token')



tokenClassifier = CamemBertForTokenClassification \\

    .pretrained('camembert_base_token_classifier_wikiner', 'en') \\

    .setInputCols(['document', 'token']) \\

    .setOutputCol('ner') \\

    .setCaseSensitive(True) \\

    .setMaxSentenceLength(512)



# Convert NER labels to entities

ner_converter = NerConverter() \\

    .setInputCols(['document', 'token', 'ner']) \\

    .setOutputCol('entities')



pipeline = Pipeline(stages=[

    document_assembler,

    tokenizer,

    tokenClassifier,

    ner_converter

])



data = spark.createDataFrame([["""Paris est la capitale de la France et abrite le Président Emmanuel Macron, qui réside au palais de l'Élysée. Apple Inc. a une présence significative dans la ville."""]]).toDF("text")

result = pipeline.fit(data).transform(data)



result.select(

    expr("explode(entities) as ner_chunk")

).select(

    col("ner_chunk.result").alias("chunk"),

    col("ner_chunk.metadata.entity").alias("ner_label")

).show(truncate=False)

''', language='python')

# Results
st.text("""

+------------------+---------+

|chunk             |ner_label|

+------------------+---------+

|Paris             |LOC      |

|France            |LOC      |

|Emmanuel Macron   |PER      |

|Élysée Palace     |ORG      |

|Apple Inc.        |ORG      |

+------------------+---------+

""")

# Performance Metrics
st.markdown('<div class="sub-title">Performance Metrics</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <p>Here are the detailed performance metrics for the CamemBERT token classification model:</p>

    <table class="benchmark-table">

        <tr>

            <th>Entity</th>

            <th>Precision</th>

            <th>Recall</th>

            <th>F1-Score</th>

        </tr>

        <tr>

            <td>LOC</td>

            <td>0.93</td>

            <td>0.94</td>

            <td>0.94</td>

        </tr>

        <tr>

            <td>PER</td>

            <td>0.95</td>

            <td>0.95</td>

            <td>0.95</td>

        </tr>

        <tr>

            <td>ORG</td>

            <td>0.92</td>

            <td>0.91</td>

            <td>0.91</td>

        </tr>

        <tr>

            <td>MISC</td>

            <td>0.86</td>

            <td>0.85</td>

            <td>0.85</td>

        </tr>

        <tr>

            <td>O</td>

            <td>0.99</td>

            <td>0.99</td>

            <td>0.99</td>

        </tr>

        <tr>

            <td>Overall</td>

            <td>0.97</td>

            <td>0.98</td>

            <td>0.98</td>

        </tr>

    </table>

</div>

""", unsafe_allow_html=True)

# Model Information - Token Classification
st.markdown('<div class="sub-title">Model Information</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <ul>

        <li><strong>Model Name:</strong> camembert_base_token_classifier_wikiner</li>

        <li><strong>Compatibility:</strong> Spark NLP 4.2.0+</li>

        <li><strong>License:</strong> Open Source</li>

        <li><strong>Edition:</strong> Official</li>

        <li><strong>Input Labels:</strong> [token, document]</li>

        <li><strong>Output Labels:</strong> [ner]</li>

        <li><strong>Language:</strong> French</li>

        <li><strong>Size:</strong> 412.2 MB</li>

        <li><strong>Case Sensitive:</strong> Yes</li>

        <li><strong>Max Sentence Length:</strong> 512</li>

    </ul>

</div>

""", unsafe_allow_html=True)

# References - Token Classification
st.markdown('<div class="sub-title">References</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <ul>

        <li><a class="link" href="https://huggingface.co/datasets/Jean-Baptiste/wikiner_fr" target="_blank" rel="noopener">CamemBERT WikiNER Dataset</a></li>

        <li><a class="link" href="https://sparknlp.org/2022/09/23/camembert_base_token_classifier_wikiner_en.html" target="_blank" rel="noopener">CamemBERT Token Classification on Spark NLP Hub</a></li>

    </ul>

</div>

""", unsafe_allow_html=True)