import streamlit as st
# Custom CSS for better styling
st.markdown("""
""", unsafe_allow_html=True)
# Main Title
st.markdown('
Named Entity Recognition (NER) in Turkish with Spark NLP
', unsafe_allow_html=True)
# Introduction
st.markdown("""
Named Entity Recognition (NER) is a task in Natural Language Processing (NLP) that involves identifying and classifying key information in a text into predefined categories. In this page, we present two different pipelines for performing NER on Turkish texts using Spark NLP:
- A pipeline using GloVe embeddings with the
turkish_ner_840B_300
model.
- A pipeline using BERT embeddings with the
turkish_ner_bert
model.
""", unsafe_allow_html=True)
# Pipeline 1: Turkish NER with GloVe Embeddings
st.markdown('Pipeline 1: Turkish NER with GloVe Embeddings
', unsafe_allow_html=True)
st.write("")
with st.expander("Turkish NER 840B_300"):
st.components.v1.html(
"""
""",
height=600
)
st.markdown("""
This pipeline utilizes GloVe embeddings to perform Named Entity Recognition. The turkish_ner_840B_300
model is a pre-trained NER model for Turkish that uses GloVe embeddings with 840 billion tokens and 300 dimensions. The pipeline includes the following stages:
- Document Assembler: Converts raw text into a format suitable for NLP processing.
- Sentence Detector: Splits the text into sentences.
- Tokenizer: Breaks sentences into tokens.
- Word Embeddings: Uses GloVe embeddings to represent tokens.
- NER Model: Applies the NER model to identify named entities.
- NER Converter: Converts the NER output into chunks representing named entities.
Here is how you can set up and use this pipeline:
""", unsafe_allow_html=True)
st.code("""
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# Document Assembler
documentAssembler = DocumentAssembler()\\
.setInputCol("text")\\
.setOutputCol("document")
# Sentence Detector
sentenceDetector = SentenceDetector()\\
.setInputCols(["document"])\\
.setOutputCol("sentence")
# Tokenizer
tokenizer = Tokenizer()\\
.setInputCols(["sentence"])\\
.setOutputCol("token")
# Word Embeddings
embeddings = WordEmbeddingsModel.pretrained('glove_840B_300', "xx")\\
.setInputCols(["sentence", 'token'])\\
.setOutputCol("embeddings")\\
.setCaseSensitive(True)
# NER Model
public_ner = NerDLModel.pretrained('turkish_ner_840B_300', 'tr')\\
.setInputCols(["sentence", "token", "embeddings"])\\
.setOutputCol("ner")
# NER Converter
ner_converter = NerConverter()\\
.setInputCols(["sentence", "token", "ner"])\\
.setOutputCol("ner_chunk")
# Pipeline
nlp_pipeline = Pipeline(
stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
public_ner,
ner_converter
]
)
""", language="python")
# Pipeline 2: Turkish NER with BERT Embeddings
st.markdown('Pipeline 2: Turkish NER with BERT Embeddings
', unsafe_allow_html=True)
st.write("")
with st.expander("Turkish NER Bert"):
st.components.v1.html(
"""
""",
height=600
)
st.markdown("""
This pipeline uses BERT embeddings for Named Entity Recognition. The turkish_ner_bert
model leverages BERT embeddings to achieve state-of-the-art results for NER tasks in Turkish. The pipeline consists of the following stages:
- Document Assembler: Converts raw text into a format suitable for NLP processing.
- Sentence Detector: Splits the text into sentences.
- Tokenizer: Breaks sentences into tokens.
- BERT Embeddings: Uses BERT embeddings to represent tokens.
- NER Model: Applies the NER model to identify named entities.
- NER Converter: Converts the NER output into chunks representing named entities.
Here is how you can set up and use this pipeline:
""", unsafe_allow_html=True)
st.code("""
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# Document Assembler
documentAssembler = DocumentAssembler()\\
.setInputCol("text")\\
.setOutputCol("document")
# Sentence Detector
sentenceDetector = SentenceDetector()\\
.setInputCols(["document"])\\
.setOutputCol("sentence")
# Tokenizer
tokenizer = Tokenizer()\\
.setInputCols(["sentence"])\\
.setOutputCol("token")
# BERT Embeddings
embeddings = BertEmbeddings.pretrained('bert_multi_cased', 'xx')\\
.setInputCols(["sentence", "token"])\\
.setOutputCol("embeddings")
# NER Model
public_ner = NerDLModel.pretrained('turkish_ner_bert', 'tr')\\
.setInputCols(["sentence", "token", "embeddings"])\\
.setOutputCol("ner")
# NER Converter
ner_converter = NerConverter()\\
.setInputCols(["sentence", "token", "ner"])\\
.setOutputCol("ner_chunk")
# Pipeline
nlp_pipeline = Pipeline(
stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
public_ner,
ner_converter
]
)
""", language="python")
# Summary
st.markdown('Summary
', unsafe_allow_html=True)
st.markdown("""
We have outlined two pipelines for performing Named Entity Recognition (NER) on Turkish texts using Spark NLP. The first pipeline uses GloVe embeddings, and the second one uses BERT embeddings. Both pipelines include stages for document assembly, sentence detection, tokenization, embedding generation, NER model application, and conversion of NER results into entity chunks.
These pipelines provide flexible options for leveraging pre-trained models in different contexts, allowing for scalable and accurate NER in Turkish.
""", unsafe_allow_html=True)
# References
st.markdown('References
', unsafe_allow_html=True)
st.markdown("""
""", unsafe_allow_html=True)
# Community & Support
st.markdown('Community & Support
', unsafe_allow_html=True)
st.markdown("""
""", unsafe_allow_html=True)