Spaces:

spark-nlp
/

CamemBERT

Sleeping

File size: 8,720 Bytes

6b63571

import streamlit as st

# Page configuration
st.set_page_config(
    layout="wide", 
    initial_sidebar_state="auto"
)

# Custom CSS for better styling
st.markdown("""

    <style>

        .main-title {

            font-size: 36px;

            color: #4A90E2;

            font-weight: bold;

            text-align: center;

        }

        .sub-title {

            font-size: 24px;

            color: #4A90E2;

            margin-top: 20px;

        }

        .section {

            background-color: #f9f9f9;

            padding: 15px;

            border-radius: 10px;

            margin-top: 20px;

        }

        .section h2 {

            font-size: 22px;

            color: #4A90E2;

        }

        .section p, .section ul {

            color: #666666;

        }

        .link {

            color: #4A90E2;

            text-decoration: none;

        }

        .benchmark-table {

            width: 100%;

            border-collapse: collapse;

            margin-top: 20px;

        }

        .benchmark-table th, .benchmark-table td {

            border: 1px solid #ddd;

            padding: 8px;

            text-align: left;

        }

        .benchmark-table th {

            background-color: #4A90E2;

            color: white;

        }

        .benchmark-table td {

            background-color: #f2f2f2;

        }

    </style>

""", unsafe_allow_html=True)

# Title
st.markdown('<div class="main-title">Introduction to CamemBERT Annotators in Spark NLP</div>', unsafe_allow_html=True)

# Subtitle
st.markdown("""

<div class="section">

    <p>Spark NLP offers a variety of CamemBERT-based annotators tailored for multiple natural language processing tasks. CamemBERT is a robust and versatile model designed specifically for the French language, offering state-of-the-art performance in a range of NLP applications. Below, we provide an overview of the four key CamemBERT annotators:</p>

</div>

""", unsafe_allow_html=True)

st.markdown("""

<div class="section">

    <h2>CamemBERT for Token Classification</h2>

    <p>The <strong>CamemBertForTokenClassification</strong> annotator is designed for Named Entity Recognition (NER) tasks using CamemBERT, a French language model derived from RoBERTa. This model efficiently handles token classification, which involves labeling tokens in a text with tags that correspond to specific entities. CamemBERT offers robust performance in French NLP tasks, making it a valuable tool for real-time applications in this language.</p>

    <p>Token classification with CamemBERT enables:</p>

    <ul>

        <li><strong>Named Entity Recognition (NER):</strong> Identifying and classifying entities such as names, organizations, locations, and other predefined categories.</li>

        <li><strong>Information Extraction:</strong> Extracting key information from unstructured text for further analysis.</li>

        <li><strong>Text Categorization:</strong> Enhancing document retrieval and categorization based on entity recognition.</li>

    </ul>

    <p>Here is an example of how CamemBERT token classification works:</p>

    <table class="benchmark-table">

        <tr>

            <th>Entity</th>

            <th>Label</th>

        </tr>

        <tr>

            <td>Paris</td>

            <td>LOC</td>

        </tr>

        <tr>

            <td>Emmanuel Macron</td>

            <td>PER</td>

        </tr>

        <tr>

            <td>Élysée Palace</td>

            <td>ORG</td>

        </tr>

    </table>

</div>

""", unsafe_allow_html=True)

# CamemBERT Token Classification - French WikiNER
st.markdown('<div class="sub-title">CamemBERT Token Classification - French WikiNER</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <p>The <strong>camembert_base_token_classifier_wikiner</strong> is a fine-tuned CamemBERT model for token classification tasks, specifically adapted for Named Entity Recognition (NER) on the French WikiNER dataset. It is designed to recognize five types of entities: O, LOC, PER, MISC, and ORG.</p>

</div>

""", unsafe_allow_html=True)

# How to Use the Model - Token Classification
st.markdown('<div class="sub-title">How to Use the Model</div>', unsafe_allow_html=True)
st.code('''

from sparknlp.base import *

from sparknlp.annotator import *

from pyspark.ml import Pipeline

from pyspark.sql.functions import col, expr



document_assembler = DocumentAssembler() \\

    .setInputCol('text') \\

    .setOutputCol('document')



tokenizer = Tokenizer() \\

    .setInputCols(['document']) \\

    .setOutputCol('token')



tokenClassifier = CamemBertForTokenClassification \\

    .pretrained('camembert_base_token_classifier_wikiner', 'en') \\

    .setInputCols(['document', 'token']) \\

    .setOutputCol('ner') \\

    .setCaseSensitive(True) \\

    .setMaxSentenceLength(512)



# Convert NER labels to entities

ner_converter = NerConverter() \\

    .setInputCols(['document', 'token', 'ner']) \\

    .setOutputCol('entities')



pipeline = Pipeline(stages=[

    document_assembler,

    tokenizer,

    tokenClassifier,

    ner_converter

])



data = spark.createDataFrame([["""Paris est la capitale de la France et abrite le Président Emmanuel Macron, qui réside au palais de l'Élysée. Apple Inc. a une présence significative dans la ville."""]]).toDF("text")

result = pipeline.fit(data).transform(data)



result.select(

    expr("explode(entities) as ner_chunk")

).select(

    col("ner_chunk.result").alias("chunk"),

    col("ner_chunk.metadata.entity").alias("ner_label")

).show(truncate=False)

''', language='python')

# Results
st.text("""

+------------------+---------+

|chunk             |ner_label|

+------------------+---------+

|Paris             |LOC      |

|France            |LOC      |

|Emmanuel Macron   |PER      |

|Élysée Palace     |ORG      |

|Apple Inc.        |ORG      |

+------------------+---------+

""")

# Performance Metrics
st.markdown('<div class="sub-title">Performance Metrics</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <p>Here are the detailed performance metrics for the CamemBERT token classification model:</p>

    <table class="benchmark-table">

        <tr>

            <th>Entity</th>

            <th>Precision</th>

            <th>Recall</th>

            <th>F1-Score</th>

        </tr>

        <tr>

            <td>LOC</td>

            <td>0.93</td>

            <td>0.94</td>

            <td>0.94</td>

        </tr>

        <tr>

            <td>PER</td>

            <td>0.95</td>

            <td>0.95</td>

            <td>0.95</td>

        </tr>

        <tr>

            <td>ORG</td>

            <td>0.92</td>

            <td>0.91</td>

            <td>0.91</td>

        </tr>

        <tr>

            <td>MISC</td>

            <td>0.86</td>

            <td>0.85</td>

            <td>0.85</td>

        </tr>

        <tr>

            <td>O</td>

            <td>0.99</td>

            <td>0.99</td>

            <td>0.99</td>

        </tr>

        <tr>

            <td>Overall</td>

            <td>0.97</td>

            <td>0.98</td>

            <td>0.98</td>

        </tr>

    </table>

</div>

""", unsafe_allow_html=True)

# Model Information - Token Classification
st.markdown('<div class="sub-title">Model Information</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <ul>

        <li><strong>Model Name:</strong> camembert_base_token_classifier_wikiner</li>

        <li><strong>Compatibility:</strong> Spark NLP 4.2.0+</li>

        <li><strong>License:</strong> Open Source</li>

        <li><strong>Edition:</strong> Official</li>

        <li><strong>Input Labels:</strong> [token, document]</li>

        <li><strong>Output Labels:</strong> [ner]</li>

        <li><strong>Language:</strong> French</li>

        <li><strong>Size:</strong> 412.2 MB</li>

        <li><strong>Case Sensitive:</strong> Yes</li>

        <li><strong>Max Sentence Length:</strong> 512</li>

    </ul>

</div>

""", unsafe_allow_html=True)

# References - Token Classification
st.markdown('<div class="sub-title">References</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <ul>

        <li><a class="link" href="https://huggingface.co/datasets/Jean-Baptiste/wikiner_fr" target="_blank" rel="noopener">CamemBERT WikiNER Dataset</a></li>

        <li><a class="link" href="https://sparknlp.org/2022/09/23/camembert_base_token_classifier_wikiner_en.html" target="_blank" rel="noopener">CamemBERT Token Classification on Spark NLP Hub</a></li>

    </ul>

</div>

""", unsafe_allow_html=True)