import streamlit as st # Custom CSS for better styling st.markdown(""" """, unsafe_allow_html=True) # Main Title st.markdown('

Named Entity Recognition (NER) in Turkish with Spark NLP

', unsafe_allow_html=True) # Introduction st.markdown("""

Named Entity Recognition (NER) is a task in Natural Language Processing (NLP) that involves identifying and classifying key information in a text into predefined categories. In this page, we present two different pipelines for performing NER on Turkish texts using Spark NLP:

A pipeline using GloVe embeddings with the turkish_ner_840B_300 model.
A pipeline using BERT embeddings with the turkish_ner_bert model.

""", unsafe_allow_html=True) # Pipeline 1: Turkish NER with GloVe Embeddings st.markdown('

Pipeline 1: Turkish NER with GloVe Embeddings

', unsafe_allow_html=True) st.write("") with st.expander("Turkish NER 840B_300"): st.components.v1.html( """ """, height=600 ) st.markdown("""

This pipeline utilizes GloVe embeddings to perform Named Entity Recognition. The turkish_ner_840B_300 model is a pre-trained NER model for Turkish that uses GloVe embeddings with 840 billion tokens and 300 dimensions. The pipeline includes the following stages:

Document Assembler: Converts raw text into a format suitable for NLP processing.
Sentence Detector: Splits the text into sentences.
Tokenizer: Breaks sentences into tokens.
Word Embeddings: Uses GloVe embeddings to represent tokens.
NER Model: Applies the NER model to identify named entities.
NER Converter: Converts the NER output into chunks representing named entities.

Here is how you can set up and use this pipeline:

""", unsafe_allow_html=True) st.code(""" from sparknlp.base import * from sparknlp.annotator import * from pyspark.ml import Pipeline # Document Assembler documentAssembler = DocumentAssembler()\\ .setInputCol("text")\\ .setOutputCol("document") # Sentence Detector sentenceDetector = SentenceDetector()\\ .setInputCols(["document"])\\ .setOutputCol("sentence") # Tokenizer tokenizer = Tokenizer()\\ .setInputCols(["sentence"])\\ .setOutputCol("token") # Word Embeddings embeddings = WordEmbeddingsModel.pretrained('glove_840B_300', "xx")\\ .setInputCols(["sentence", 'token'])\\ .setOutputCol("embeddings")\\ .setCaseSensitive(True) # NER Model public_ner = NerDLModel.pretrained('turkish_ner_840B_300', 'tr')\\ .setInputCols(["sentence", "token", "embeddings"])\\ .setOutputCol("ner") # NER Converter ner_converter = NerConverter()\\ .setInputCols(["sentence", "token", "ner"])\\ .setOutputCol("ner_chunk") # Pipeline nlp_pipeline = Pipeline( stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, public_ner, ner_converter ] ) """, language="python") # Pipeline 2: Turkish NER with BERT Embeddings st.markdown('

Pipeline 2: Turkish NER with BERT Embeddings

', unsafe_allow_html=True) st.write("") with st.expander("Turkish NER Bert"): st.components.v1.html( """ """, height=600 ) st.markdown("""

This pipeline uses BERT embeddings for Named Entity Recognition. The turkish_ner_bert model leverages BERT embeddings to achieve state-of-the-art results for NER tasks in Turkish. The pipeline consists of the following stages:

Document Assembler: Converts raw text into a format suitable for NLP processing.
Sentence Detector: Splits the text into sentences.
Tokenizer: Breaks sentences into tokens.
BERT Embeddings: Uses BERT embeddings to represent tokens.
NER Model: Applies the NER model to identify named entities.
NER Converter: Converts the NER output into chunks representing named entities.

Here is how you can set up and use this pipeline:

""", unsafe_allow_html=True) st.code(""" from sparknlp.base import * from sparknlp.annotator import * from pyspark.ml import Pipeline # Document Assembler documentAssembler = DocumentAssembler()\\ .setInputCol("text")\\ .setOutputCol("document") # Sentence Detector sentenceDetector = SentenceDetector()\\ .setInputCols(["document"])\\ .setOutputCol("sentence") # Tokenizer tokenizer = Tokenizer()\\ .setInputCols(["sentence"])\\ .setOutputCol("token") # BERT Embeddings embeddings = BertEmbeddings.pretrained('bert_multi_cased', 'xx')\\ .setInputCols(["sentence", "token"])\\ .setOutputCol("embeddings") # NER Model public_ner = NerDLModel.pretrained('turkish_ner_bert', 'tr')\\ .setInputCols(["sentence", "token", "embeddings"])\\ .setOutputCol("ner") # NER Converter ner_converter = NerConverter()\\ .setInputCols(["sentence", "token", "ner"])\\ .setOutputCol("ner_chunk") # Pipeline nlp_pipeline = Pipeline( stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, public_ner, ner_converter ] ) """, language="python") # Summary st.markdown('

Summary

', unsafe_allow_html=True) st.markdown("""

We have outlined two pipelines for performing Named Entity Recognition (NER) on Turkish texts using Spark NLP. The first pipeline uses GloVe embeddings, and the second one uses BERT embeddings. Both pipelines include stages for document assembly, sentence detection, tokenization, embedding generation, NER model application, and conversion of NER results into entity chunks.

These pipelines provide flexible options for leveraging pre-trained models in different contexts, allowing for scalable and accurate NER in Turkish.

""", unsafe_allow_html=True) # References st.markdown('

References

', unsafe_allow_html=True) st.markdown("""

""", unsafe_allow_html=True) # Community & Support st.markdown('

Community & Support

', unsafe_allow_html=True) st.markdown("""

Official Website: Documentation and examples
GitHub Repository: Report issues or contribute
Community Forum: Ask questions, share ideas, and get support

""", unsafe_allow_html=True)