import streamlit as st # Page configuration st.set_page_config( layout="wide", initial_sidebar_state="auto" ) # Custom CSS for better styling st.markdown(""" """, unsafe_allow_html=True) # Title st.markdown('

Introduction to RoBERTa Annotators in Spark NLP

', unsafe_allow_html=True) # Subtitle st.markdown("""

RoBERTa (A Robustly Optimized BERT Pretraining Approach) builds on BERT's language model by modifying key hyperparameters and pretraining techniques to enhance its performance. RoBERTa achieves state-of-the-art results in various NLP tasks. Below, we provide an overview of the RoBERTa annotator for token classification, zero-shot classification, and sequence classification:

""", unsafe_allow_html=True) tab1, tab2, tab3, tab4 = st.tabs(["RoBERTa for Token Classification", "RoBERTa for Zero Shot Classification", "RoBERTa for Sequence Classification", "RoBERTa for Question Answering"]) with tab1: st.markdown("""

RoBERTa for Token Classification

The RoBertaForTokenClassification annotator is designed for Named Entity Recognition (NER) tasks using the RoBERTa model. This pretrained model is adapted from a Hugging Face model and imported into Spark NLP, offering robust performance in identifying and classifying entities in text. The RoBERTa model, with its large-scale pretraining, delivers state-of-the-art results on NER tasks.

Token classification with RoBERTa enables:

Named Entity Recognition (NER): Identifying and classifying entities such as miscellaneous (MISC), organizations (ORG), locations (LOC), and persons (PER).
Information Extraction: Extracting key information from unstructured text for further analysis.
Text Categorization: Enhancing document retrieval and categorization based on entity recognition.

Here is an example of how RoBERTa token classification works:

Entity	Label
Apple	ORG
Elon Musk	PER
California	LOC

""", unsafe_allow_html=True) # RoBERTa Token Classification - NER Large st.markdown('

RoBERTa Token Classification - NER Large

', unsafe_allow_html=True) st.markdown("""

The roberta_ner_roberta_large_ner_english is a fine-tuned RoBERTa model for token classification tasks, specifically adapted for Named Entity Recognition (NER) on English text. It recognizes four types of entities: location (LOC), organizations (ORG), person (PER), and Miscellaneous (MISC).

""", unsafe_allow_html=True) # How to Use the Model - Token Classification st.markdown('

How to Use the Model

', unsafe_allow_html=True) st.code(''' from sparknlp.base import * from sparknlp.annotator import * from pyspark.ml import Pipeline from pyspark.sql.functions import col, expr document_assembler = DocumentAssembler() \\ .setInputCol("text") \\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\\ .setInputCols(["document"])\\ .setOutputCol("sentence") tokenizer = Tokenizer() \\ .setInputCols(["sentence"]) \\ .setOutputCol("token") tokenClassifier = RoBertaForTokenClassification \\ .pretrained("roberta_ner_roberta_large_ner_english", "en") \\ .setInputCols(["sentence", "token"]) \\ .setOutputCol("ner") ner_converter = NerConverter() \\ .setInputCols(['sentence', 'token', 'ner']) \\ .setOutputCol('entities') pipeline = Pipeline(stages=[ document_assembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter ]) data = spark.createDataFrame([["William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000.[9] He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."]]).toDF("text") result = pipeline.fit(data).transform(data) result.select( expr("explode(entities) as ner_chunk") ).select( col("ner_chunk.result").alias("chunk"), col("ner_chunk.metadata.entity").alias("ner_label") ).show(truncate=False) ''', language='python') # Results st.text(""" +-------------------------------+---------+ |chunk |ner_label| +-------------------------------+---------+ |William Henry Gates III |R | |American |SC | |Microsoft Corporation |G | |Microsoft |G | |Gates |R | |Seattle |C | |Washington |C | |Gates co-founded Microsoft |R | |Paul Allen |R | |Albuquerque |C | |New Mexico |C | |Gates |R | |Gates |R | |Gates |R | |Microsoft |G | |Bill & Melinda Gates Foundation|G | |Melinda Gates |R | |Ray Ozzie |R | |Craig Mundie |R | |Microsoft |G | +-------------------------------+---------+ """) # Model Info Section st.markdown('

Model Info

', unsafe_allow_html=True) st.markdown("""

Model Name: roberta_ner_roberta_large_ner_english
Compatibility: Spark NLP 3.4.2+
License: Open Source
Edition: Official
Input Labels: [document, token]
Output Labels: [ner]
Language: English (en)
Size: 1.3 GB
Case Sensitive: True
Max Sentence Length: 128

""", unsafe_allow_html=True) # References Section st.markdown('

References

', unsafe_allow_html=True) st.markdown("""

""", unsafe_allow_html=True) with tab2: # RoBERTa Zero-Shot Classification st.markdown("""

RoBERTa for Zero-Shot Classification

The RoBertaForZeroShotClassification annotator is designed for zero-shot text classification, particularly in English. This model utilizes the RoBERTa Base architecture fine-tuned on Natural Language Inference (NLI) tasks, allowing it to classify text into labels it has not seen during training.

Key features of this model include:

Zero-Shot Classification: Classify text into dynamic categories defined at runtime without requiring predefined classes.
Flexibility: Adjusts to different classification scenarios by specifying candidate labels as needed.
Model Foundation: Based on RoBERTa and fine-tuned with NLI data for robust performance across various tasks.

This model is ideal for applications where predefined categories are not available or frequently change, offering flexibility and adaptability in text classification tasks.

Text	Predicted Category
"I have a problem with my iPhone that needs to be resolved ASAP!!"	Urgent
"The latest advancements in technology are fascinating."	Technology

""", unsafe_allow_html=True) # RoBERTA Zero-Shot Classification Base - NLI st.markdown('

RoBERTA Zero-Shot Classification Base - NLI

', unsafe_allow_html=True) st.markdown("""

The roberta_base_zero_shot_classifier_nli model is tailored for zero-shot text classification tasks, enabling dynamic classification based on labels specified at runtime. Fine-tuned on Natural Language Inference (NLI) tasks, this model leverages the RoBERTa architecture to provide flexible and robust classification capabilities.

""", unsafe_allow_html=True) # How to Use the Model - Zero-Shot Classification st.markdown('

How to Use the Model

', unsafe_allow_html=True) st.code(''' from sparknlp.base import * from sparknlp.annotator import * from pyspark.ml import Pipeline document_assembler = DocumentAssembler() \\ .setInputCol('text') \\ .setOutputCol('document') tokenizer = Tokenizer() \\ .setInputCols(['document']) \\ .setOutputCol('token') zeroShotClassifier = RoBertaForZeroShotClassification \\ .pretrained('roberta_base_zero_shot_classifier_nli', 'en') \\ .setInputCols(['token', 'document']) \\ .setOutputCol('class') \\ .setCaseSensitive(False) \\ .setMaxSentenceLength(512) \\ .setCandidateLabels(["urgent", "mobile", "travel", "movie", "music", "sport", "weather", "technology"]) pipeline = Pipeline(stages=[ document_assembler, tokenizer, zeroShotClassifier ]) example = spark.createDataFrame([['I have a problem with my iPhone that needs to be resolved ASAP!!']]).toDF("text") result = pipeline.fit(example).transform(example) result.select('document.result', 'class.result').show(truncate=False) ''', language='python') st.text(""" +------------------------------------------------------------------+------------+ |result |result | +------------------------------------------------------------------+------------+ |[I have a problem with my iPhone that needs to be resolved ASAP!!]|[technology]| +------------------------------------------------------------------+------------+ """) # Model Information - Zero-Shot Classification st.markdown('

Model Information

', unsafe_allow_html=True) st.markdown("""

Attribute	Description
Model Name	roberta_base_zero_shot_classifier_nli
Compatibility	Spark NLP 4.4.2+
License	Open Source
Edition	Official
Input Labels	[token, document]
Output Labels	[multi_class]
Language	en
Size	466.4 MB
Case Sensitive	true

""", unsafe_allow_html=True) # References - Zero-Shot Classification st.markdown('

References

', unsafe_allow_html=True) st.markdown("""

""", unsafe_allow_html=True) with tab3: # RoBERTa Sequence Classification st.markdown("""

RoBERTa for Sequence Classification

The RoBertaForSequenceClassification annotator is designed for tasks such as sentiment analysis and sequence classification using the RoBERTa model. This model handles classification tasks efficiently and is adapted for production-readiness with Spark NLP.

Sequence classification with RoBERTa enables:

Sentiment Analysis: Determining sentiment expressed in text as negative, neutral, or positive.
Text Classification: Categorizing text into predefined classes such as sentiment or topic categories.
Document Analysis: Enhancing the analysis and categorization of documents based on content.

Here is an example of how RoBERTa sequence classification works:

Text	Label
The new RoBERTa model shows significant improvements in performance.	Positive
The training was not very effective and did not yield desired results.	Negative
The overall feedback on the new features has been mixed.	Neutral

""", unsafe_allow_html=True) # RoBERTa Sequence Classification - ACTS Feedback1 st.markdown('

RoBERTa Sequence Classification - ACTS Feedback1

', unsafe_allow_html=True) st.markdown("""

The roberta_classifier_acts_feedback1 model is a fine-tuned RoBERTa model for sequence classification tasks, specifically adapted for English text. This model was originally trained by mp6kv and is curated to provide scalability and production-readiness using Spark NLP. It can classify text into three categories: negative, neutral, and positive.

""", unsafe_allow_html=True) # How to Use the Model - Sequence Classification st.markdown('

How to Use the Model

Model Info

', unsafe_allow_html=True) st.markdown("""

Model Name: roberta_classifier_acts_feedback1
Compatibility: Spark NLP 5.2.0+
License: Open Source
Edition: Official
Input Labels: [document, token]
Output Labels: [class]
Language: en
Size: 424.8 MB
Case Sensitive: True
Max Sentence Length: 256

""", unsafe_allow_html=True) # References Section st.markdown('

References

', unsafe_allow_html=True) st.markdown("""

""", unsafe_allow_html=True) with tab4: st.markdown("""

RoBERTa for Question Answering

The RoBertaForQuestionAnswering annotator is designed for extracting answers from a given context based on a specific question. This model leverages RoBERTa's capabilities to accurately find and provide answers, making it suitable for applications that require detailed information retrieval. Question answering with RoBERTa is especially useful for:

Building Advanced QA Systems: Developing systems capable of answering user queries with high accuracy.
Enhancing Customer Service: Providing precise answers to customer questions in support environments.
Improving Information Retrieval: Extracting specific answers from large text corpora.

Utilizing this annotator can significantly enhance your ability to retrieve and deliver accurate answers from text data.

Context	Question	Predicted Answer
"The Eiffel Tower is one of the most recognizable structures in the world. It was constructed in 1889 as the entrance arch to the 1889 World's Fair held in Paris, France."	"When was the Eiffel Tower constructed?"	1889
"The Amazon rainforest, also known as Amazonia, is a vast tropical rainforest in South America. It is home to an incredible diversity of flora and fauna."	"What is the Amazon rainforest also known as?"	Amazonia
"The Great Wall of China is a series of fortifications made of various materials, stretching over 13,000 miles across northern China."	"How long is the Great Wall of China?"	13,000 miles

""", unsafe_allow_html=True) # RoBERTa for Question Answering - icebert_finetuned_squad_10 st.markdown('

icebert_finetuned_squad_10

', unsafe_allow_html=True) st.markdown("""

This model is a pretrained RoBERTa model, adapted from Hugging Face, specifically fine-tuned for question-answering tasks. It has been curated to provide scalability and production-readiness using Spark NLP. The icebert_finetuned_squad_10 model is originally trained by gudjonk93 for English language tasks.

""", unsafe_allow_html=True) # How to Use the Model - Question Answering st.markdown('

How to Use the Model

', unsafe_allow_html=True) st.code(''' from sparknlp.base import * from sparknlp.annotator import * from pyspark.ml import Pipeline # Document Assembler document_assembler = MultiDocumentAssembler() \\ .setInputCols(["question", "context"]) \\ .setOutputCols(["document_question", "document_context"]) # RoBertaForQuestionAnswering spanClassifier = RoBertaForQuestionAnswering.pretrained("icebert_finetuned_squad_10", "en") \\ .setInputCols(["document_question", "document_context"]) \\ .setOutputCol("answer") # Pipeline pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) # Create example DataFrame example = spark.createDataFrame([ ["What's my name?", "My name is Clara and I live in Berkeley."] ]).toDF("question", "context") # Fit and transform the data pipelineModel = pipeline.fit(example) result = pipelineModel.transform(example) # Show results result.select('document_question.result', 'answer.result').show(truncate=False) ''', language='python') st.text(""" +-----------------+-------+ |result |result | +-----------------+-------+ |[What's my name?]|[Clara]| +-----------------+-------+ """) # Model Information - Question Answering st.markdown('

Model Information

', unsafe_allow_html=True) st.markdown("""

Attribute	Description
Model Name	icebert_finetuned_squad_10
Compatibility	Spark NLP 5.2.1+
License	Open Source
Edition	Official
Input Labels	[document_question, document_context]
Output Labels	[answer]
Language	en
Size	450.4 MB

""", unsafe_allow_html=True) # References - Question Answering st.markdown('

References

', unsafe_allow_html=True) st.markdown("""

""", unsafe_allow_html=True) # Community & Support st.markdown('

Community & Support

', unsafe_allow_html=True) st.markdown("""

Official Website: Documentation and examples
Slack: Live discussion with the community and team
GitHub: Bug reports, feature requests, and contributions
Medium: Spark NLP articles
YouTube: Video tutorials

""", unsafe_allow_html=True)