Spaces:

spark-nlp
/

Turkish-NER

Sleeping

App Files Files Community

abdullahmubeen10 commited on Jul 27, 2024

Commit

34ec57f

verified ·

1 Parent(s): 5ab3596

Upload 5 files

Browse files

Files changed (5) hide show

.streamlit/config.toml +3 -0
Demo.py +175 -0
Dockerfile +70 -0
pages/Workflow & Model Overview.py +244 -0
requirements.txt +6 -0

.streamlit/config.toml ADDED Viewed

	@@ -0,0 +1,3 @@

+[theme]
+base="light"
+primaryColor="#29B4E8"

Demo.py ADDED Viewed

	@@ -0,0 +1,175 @@

+import streamlit as st
+import sparknlp
+import os
+import pandas as pd
+from sparknlp.base import *
+from sparknlp.annotator import *
+from pyspark.ml import Pipeline
+from sparknlp.pretrained import PretrainedPipeline
+from annotated_text import annotated_text
+# Page configuration
+st.set_page_config(
+    layout="wide",
+    initial_sidebar_state="auto"
+)
+# CSS for styling
+st.markdown("""
+    <style>
+        .main-title {
+            font-size: 36px;
+            color: #4A90E2;
+            font-weight: bold;
+            text-align: center;
+        }
+        .section {
+            background-color: #f9f9f9;
+            padding: 10px;
+            border-radius: 10px;
+            margin-top: 10px;
+        }
+        .section p, .section ul {
+            color: #666666;
+        }
+    </style>
+""", unsafe_allow_html=True)
+@st.cache_resource
+def init_spark():
+    return sparknlp.start()
+@st.cache_resource
+def create_pipeline(model):
+    documentAssembler = DocumentAssembler()\
+        .setInputCol("text")\
+        .setOutputCol("document")
+    sentenceDetector = SentenceDetector()\
+        .setInputCols(["document"])\
+        .setOutputCol("sentence")
+    tokenizer = Tokenizer()\
+        .setInputCols(["sentence"])\
+        .setOutputCol("token")
+    embeddings = None
+    public_ner = None
+    if model == 'turkish_ner_840B_300' :
+        embeddings = WordEmbeddingsModel.pretrained('glove_840B_300', "xx").\
+                        setInputCols(["sentence", 'token']).\
+                        setOutputCol("embeddings").\
+                        setCaseSensitive(True)
+    elif model == 'turkish_ner_bert' :
+        embeddings = BertEmbeddings.pretrained('bert_multi_cased', 'xx') \
+            .setInputCols(["sentence", "token"])\
+            .setOutputCol("embeddings")
+    public_ner = NerDLModel.pretrained(model, 'tr') \
+            .setInputCols(["sentence", "token", "embeddings"]) \
+            .setOutputCol("ner")
+    ner_converter = NerConverter() \
+            .setInputCols(["sentence", "token", "ner"]) \
+            .setOutputCol("ner_chunk")
+    nlp_pipeline = Pipeline(
+        stages=[
+            documentAssembler,
+            sentenceDetector,
+            tokenizer,
+            embeddings,
+            public_ner,
+            ner_converter])
+    return nlp_pipeline
+def fit_data(pipeline, data):
+  empty_df = spark.createDataFrame([['']]).toDF('text')
+  pipeline_model = pipeline.fit(empty_df)
+  model = LightPipeline(pipeline_model)
+  result = model.fullAnnotate(data)
+  return result
+def annotate(data):
+    document, chunks, labels = data["Document"], data["NER Chunk"], data["NER Label"]
+    annotated_words = []
+    for chunk, label in zip(chunks, labels):
+        parts = document.split(chunk, 1)
+        if parts[0]:
+            annotated_words.append(parts[0])
+        annotated_words.append((chunk, label))
+        document = parts[1]
+    if document:
+        annotated_words.append(document)
+    annotated_text(*annotated_words)
+# Set up the page layout
+st.markdown('<div class="main-title">Recognize entities in Turkish text</div>', unsafe_allow_html=True)
+st.markdown('<div class="section"><p>Recognize Persons, Locations, Organizations and Misc entities using an out of the box pretrained Deep Learning model and multi-lingual Bert word embeddings (bert_multi_cased) and GloVe word embeddings (glove_100d)</p></div>', unsafe_allow_html=True)
+# Sidebar content
+model = st.sidebar.selectbox(
+    "Choose the pretrained model",
+    ["turkish_ner_bert", "turkish_ner_840B_300"],
+    help="For more info about the models visit: https://sparknlp.org/models"
+)
+# Reference notebook link in sidebar
+link = """
+<a href="https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_TR.ipynb">
+    <img src="https://colab.research.google.com/assets/colab-badge.svg" style="zoom: 1.3" alt="Open In Colab"/>
+</a>
+"""
+st.sidebar.markdown('Reference notebook:')
+st.sidebar.markdown(link, unsafe_allow_html=True)
+# Load examples
+examples = [
+    "William Henry Gates III (28 Ekim 1955 doğumlu) Amerikalı bir iş insanı, yazılım geliştiricisi, yatırımcı ve hayırseverdir. En çok Microsoft Corporation'ın kurucu ortağı olarak tanınır. Microsoft'taki kariyerinde Gates, başkan, genel müdür (CEO), başkan ve baş yazılım mimarı gibi görevlerde bulunmuş ve Mayıs 2014'e kadar en büyük bireysel hissedar olarak kalmıştır. 1970'ler ve 1980'ler mikro bilgisayar devriminin en iyi bilinen girişimcilerinden ve öncülerindendir. Seattle, Washington'da doğup büyüyen Gates, 1975 yılında çocukluk arkadaşı Paul Allen ile Microsoft'u kurdu ve şirket, dünyanın en büyük kişisel bilgisayar yazılımı şirketi haline geldi. Gates, şirketi başkan ve CEO olarak yönetti ve Ocak 2000'de CEO olarak görevden ayrıldı, ancak başkan olarak kalmaya devam etti ve baş yazılım mimarı oldu. 1990'ların sonlarında Gates, iş taktikleri nedeniyle eleştirildi; bu görüş, birçok mahkeme kararı tarafından desteklenmiştir. Haziran 2006'da Gates, Microsoft'ta yarı zamanlı bir role geçeceğini ve 2000 yılında kendisi ve eşi Melinda Gates tarafından kurulan Bill & Melinda Gates Vakfı'nda tam zamanlı çalışacağını duyurdu. Görevlerini Ray Ozzie ve Craig Mundie'ye devretti. Şubat 2014'te Microsoft'taki başkanlık görevinden ayrıldı ve yeni atanan CEO Satya Nadella'ya destek olmak için teknoloji danışmanı olarak yeni bir göreve başladı.",
+    "Mona Lisa, Leonardo tarafından yaratılmış 16. yüzyıldan kalma bir yağlı boya tablodur. Louvre'da Paris'te sergilenmektedir.",
+    "Sebastian Thrun, 2007 yılında Google'da kendi kendine giden arabalar üzerinde çalışmaya başladığında, şirket dışındaki pek çok insan onu ciddiye almadı. ‘Size çok kıdemli Amerikan otomobil şirketlerinin CEO'larının elimi sıktığını ve konuşmaya değer biri olmadığım için uzaklaştığını söyleyebilirim’ dedi Thrun, şimdi online yüksek öğrenim girişimi Udacity'nin kurucu ortağı ve CEO'su, bu hafta Recode ile yaptığı bir röportajda.",
+    "Facebook, 4 Şubat 2004'te TheFacebook olarak başlatılan bir sosyal ağ hizmetidir. Mark Zuckerberg tarafından, üniversite arkadaşları ve Harvard Üniversitesi öğrencileri Eduardo Saverin, Andrew McCollum, Dustin Moskovitz ve Chris Hughes ile birlikte kurulmuştur. Web sitesinin üyeliği başlangıçta Harvard öğrencileriyle sınırlıydı, ancak Boston bölgesindeki diğer kolejler, Ivy League ve giderek çoğu üniversiteye genişletilmiştir.",
+    "Doğal dil işleme tarihinin genellikle 1950'lerde başladığı kabul edilir, ancak daha önceki dönemlerde yapılan çalışmalar da vardır. 1950'de, Alan Turing 'Computing Machinery and Intelligence' başlıklı bir makale yayımlamış ve günümüzde Turing testi olarak bilinen zekâ kriterini önermiştir.",
+    "Geoffrey Everest Hinton, yapay sinir ağları üzerindeki çalışmaları ile en çok tanınan İngiliz Kanadalı bilişsel psikolog ve bilgisayar bilimcisidir. 2013'ten beri zamanını Google ve Toronto Üniversitesi'nde geçirmektedir. 2017'de Toronto'daki Vector Institute'in kurucu ortağı olmuş ve Baş Bilimsel Danışman olarak atanmıştır.",
+    "John'a Alaska'ya taşınmak istediğimi söylediğimde, orada bir Starbucks bulmanın zor olacağını bana söyledi.",
+    "Steven Paul Jobs, Amerikalı bir iş insanı, endüstriyel tasarımcı, yatırımcı ve medya sahibi olarak bilinir. Apple Inc.'in başkanı, genel müdürü (CEO) ve kurucu ortağı, Pixar'ın başkanı ve çoğunluk hissedarı, The Walt Disney Company'nin Pixar'ı satın almasının ardından yönetim kurulu üyesi ve NeXT'in kurucusu, başkanı ve CEO'suydu. Jobs, Apple kurucu ortağı Steve Wozniak ile birlikte 1970'ler ve 1980'ler kişisel bilgisayar devriminin öncülerinden biri olarak tanınır. San Francisco, California'da doğmuş ve evlatlık verilmiştir. San Francisco Körfez Bölgesi'nde büyütülmüştür. 1972'de Reed College'a gitmiş, aynı yıl üniversiteden ayrılmış ve 1974'te Hindistan'a giderek aydınlanma arayışında bulunmuş ve Zen Budizmi üzerine çalışmıştır.",
+    "Titanic, James Cameron tarafından yönetilmiş, yazılmış, ortak yapımcılığı ve ortak kurgusu yapılmış 1997 Amerikan epik romantik ve felaket filmidir. Hem tarihi hem de kurgusal yönler içeren film, RMS Titanic'in batışı hakkında anlatımlara dayanır ve Leonardo DiCaprio ile Kate Winslet'i, geminin talihsiz ilk seferinde farklı sosyal sınıflardan gelen aşıklar olarak canlandırır.",
+    "Kuzey'in kralı olmanın dışında, John Snow, İngiliz bir doktor ve anestezi ve tıbbi hijyen gelişiminde lider olarak kabul edilir. 1834'te kolera salgınını veriler kullanarak tedavi eden ilk kişi olarak kabul edilir."
+]
+selected_text = st.selectbox("Select an example", examples)
+custom_input = st.text_input("Try it with your own Sentence!")
+text_to_analyze = custom_input if custom_input else selected_text
+st.subheader('Full example text')
+HTML_WRAPPER = """<div class="scroll entities" style="overflow-x: auto; border: 1px solid #e6e9ef; border-radius: 0.25rem; padding: 1rem; margin-bottom: 2.5rem; white-space:pre-wrap">{}</div>"""
+st.markdown(HTML_WRAPPER.format(text_to_analyze), unsafe_allow_html=True)
+# Initialize Spark and create pipeline
+spark = init_spark()
+pipeline = create_pipeline(model)
+output = fit_data(pipeline, text_to_analyze)
+# Display matched sentence
+st.subheader("Processed output:")
+results = {
+    'Document': output[0]['document'][0].result,
+    'NER Chunk': [n.result for n in output[0]['ner_chunk']],
+    "NER Label": [n.metadata['entity'] for n in output[0]['ner_chunk']]
+}
+annotate(results)
+with st.expander("View DataFrame"):
+    df = pd.DataFrame({'NER Chunk': results['NER Chunk'], 'NER Label': results['NER Label']})
+    df.index += 1
+    st.dataframe(df)

Dockerfile ADDED Viewed

	@@ -0,0 +1,70 @@

+# Download base image ubuntu 18.04
+FROM ubuntu:18.04
+# Set environment variables
+ENV NB_USER jovyan
+ENV NB_UID 1000
+ENV HOME /home/${NB_USER}
+# Install required packages
+RUN apt-get update && apt-get install -y \
+    tar \
+    wget \
+    bash \
+    rsync \
+    gcc \
+    libfreetype6-dev \
+    libhdf5-serial-dev \
+    libpng-dev \
+    libzmq3-dev \
+    python3 \
+    python3-dev \
+    python3-pip \
+    unzip \
+    pkg-config \
+    software-properties-common \
+    graphviz \
+    openjdk-8-jdk \
+    ant \
+    ca-certificates-java \
+    && apt-get clean \
+    && update-ca-certificates -f;
+# Install Python 3.8 and pip
+RUN add-apt-repository ppa:deadsnakes/ppa \
+    && apt-get update \
+    && apt-get install -y python3.8 python3-pip \
+    && apt-get clean;
+# Set up JAVA_HOME
+ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/
+RUN mkdir -p ${HOME} \
+    && echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/" >> ${HOME}/.bashrc \
+    && chown -R ${NB_UID}:${NB_UID} ${HOME}
+# Create a new user named "jovyan" with user ID 1000
+RUN useradd -m -u ${NB_UID} ${NB_USER}
+# Switch to the "jovyan" user
+USER ${NB_USER}
+# Set home and path variables for the user
+ENV HOME=/home/${NB_USER} \
+    PATH=/home/${NB_USER}/.local/bin:$PATH
+# Set the working directory to the user's home directory
+WORKDIR ${HOME}
+# Upgrade pip and install Python dependencies
+RUN python3.8 -m pip install --upgrade pip
+COPY requirements.txt /tmp/requirements.txt
+RUN python3.8 -m pip install -r /tmp/requirements.txt
+# Copy the application code into the container at /home/jovyan
+COPY --chown=${NB_USER}:${NB_USER} . ${HOME}
+# Expose port for Streamlit
+EXPOSE 7860
+# Define the entry point for the container
+ENTRYPOINT ["streamlit", "run", "Demo.py", "--server.port=7860", "--server.address=0.0.0.0"]

pages/Workflow & Model Overview.py ADDED Viewed

	@@ -0,0 +1,244 @@

+import streamlit as st
+# Custom CSS for better styling
+st.markdown("""
+    <style>
+        .main-title {
+            font-size: 36px;
+            color: #4A90E2;
+            font-weight: bold;
+            text-align: center;
+        }
+        .sub-title {
+            font-size: 24px;
+            color: #4A90E2;
+            margin-top: 20px;
+        }
+        .section {
+            background-color: #f9f9f9;
+            padding: 15px;
+            border-radius: 10px;
+            margin-top: 20px;
+        }
+        .section h2 {
+            font-size: 22px;
+            color: #4A90E2;
+        }
+        .section p, .section ul {
+            color: #666666;
+        }
+        .link {
+            color: #4A90E2;
+            text-decoration: none;
+        }
+    </style>
+""", unsafe_allow_html=True)
+# Main Title
+st.markdown('<div class="main-title">Named Entity Recognition (NER) in Turkish with Spark NLP</div>', unsafe_allow_html=True)
+# Introduction
+st.markdown("""
+<div class="section">
+    <p>Named Entity Recognition (NER) is a task in Natural Language Processing (NLP) that involves identifying and classifying key information in a text into predefined categories. In this page, we present two different pipelines for performing NER on Turkish texts using Spark NLP:</p>
+    <ul>
+        <li>A pipeline using GloVe embeddings with the <code>turkish_ner_840B_300</code> model.</li>
+        <li>A pipeline using BERT embeddings with the <code>turkish_ner_bert</code> model.</li>
+    </ul>
+</div>
+""", unsafe_allow_html=True)
+# Pipeline 1: Turkish NER with GloVe Embeddings
+st.markdown('<div class="sub-title">Pipeline 1: Turkish NER with GloVe Embeddings</div>', unsafe_allow_html=True)
+st.write("")
+with st.expander("Turkish NER 840B_300"):
+    st.components.v1.html(
+        """
+        <iframe
+            src="https://sparknlp.org/2020/11/10/turkish_ner_840B_300_tr.html"
+            width="100%"
+            height="600px"
+            style="border:none;"
+            title="Embedded Website">
+        </iframe>
+        """,
+        height=600
+    )
+st.markdown("""
+<div class="section">
+    <p>This pipeline utilizes GloVe embeddings to perform Named Entity Recognition. The <code>turkish_ner_840B_300</code> model is a pre-trained NER model for Turkish that uses GloVe embeddings with 840 billion tokens and 300 dimensions. The pipeline includes the following stages:</p>
+    <ul>
+        <li><strong>Document Assembler:</strong> Converts raw text into a format suitable for NLP processing.</li>
+        <li><strong>Sentence Detector:</strong> Splits the text into sentences.</li>
+        <li><strong>Tokenizer:</strong> Breaks sentences into tokens.</li>
+        <li><strong>Word Embeddings:</strong> Uses GloVe embeddings to represent tokens.</li>
+        <li><strong>NER Model:</strong> Applies the NER model to identify named entities.</li>
+        <li><strong>NER Converter:</strong> Converts the NER output into chunks representing named entities.</li>
+    </ul>
+    <p>Here is how you can set up and use this pipeline:</p>
+</div>
+""", unsafe_allow_html=True)
+st.code("""
+from sparknlp.base import *
+from sparknlp.annotator import *
+from pyspark.ml import Pipeline
+# Document Assembler
+documentAssembler = DocumentAssembler()\\
+    .setInputCol("text")\\
+    .setOutputCol("document")
+# Sentence Detector
+sentenceDetector = SentenceDetector()\\
+    .setInputCols(["document"])\\
+    .setOutputCol("sentence")
+# Tokenizer
+tokenizer = Tokenizer()\\
+    .setInputCols(["sentence"])\\
+    .setOutputCol("token")
+# Word Embeddings
+embeddings = WordEmbeddingsModel.pretrained('glove_840B_300', "xx")\\
+    .setInputCols(["sentence", 'token'])\\
+    .setOutputCol("embeddings")\\
+    .setCaseSensitive(True)
+# NER Model
+public_ner = NerDLModel.pretrained('turkish_ner_840B_300', 'tr')\\
+    .setInputCols(["sentence", "token", "embeddings"])\\
+    .setOutputCol("ner")
+# NER Converter
+ner_converter = NerConverter()\\
+    .setInputCols(["sentence", "token", "ner"])\\
+    .setOutputCol("ner_chunk")
+# Pipeline
+nlp_pipeline = Pipeline(
+    stages=[
+        documentAssembler,
+        sentenceDetector,
+        tokenizer,
+        embeddings,
+        public_ner,
+        ner_converter
+    ]
+)
+""", language="python")
+# Pipeline 2: Turkish NER with BERT Embeddings
+st.markdown('<div class="sub-title">Pipeline 2: Turkish NER with BERT Embeddings</div>', unsafe_allow_html=True)
+st.write("")
+with st.expander("Turkish NER Bert"):
+    st.components.v1.html(
+        """
+        <iframe
+            src="https://sparknlp.org/2020/11/10/turkish_ner_bert_tr.html"
+            width="100%"
+            height="600px"
+            style="border:none;"
+            title="Embedded Website">
+        </iframe>
+        """,
+        height=600
+    )
+st.markdown("""
+<div class="section">
+    <p>This pipeline uses BERT embeddings for Named Entity Recognition. The <code>turkish_ner_bert</code> model leverages BERT embeddings to achieve state-of-the-art results for NER tasks in Turkish. The pipeline consists of the following stages:</p>
+    <ul>
+        <li><strong>Document Assembler:</strong> Converts raw text into a format suitable for NLP processing.</li>
+        <li><strong>Sentence Detector:</strong> Splits the text into sentences.</li>
+        <li><strong>Tokenizer:</strong> Breaks sentences into tokens.</li>
+        <li><strong>BERT Embeddings:</strong> Uses BERT embeddings to represent tokens.</li>
+        <li><strong>NER Model:</strong> Applies the NER model to identify named entities.</li>
+        <li><strong>NER Converter:</strong> Converts the NER output into chunks representing named entities.</li>
+    </ul>
+    <p>Here is how you can set up and use this pipeline:</p>
+</div>
+""", unsafe_allow_html=True)
+st.code("""
+from sparknlp.base import *
+from sparknlp.annotator import *
+from pyspark.ml import Pipeline
+# Document Assembler
+documentAssembler = DocumentAssembler()\\
+    .setInputCol("text")\\
+    .setOutputCol("document")
+# Sentence Detector
+sentenceDetector = SentenceDetector()\\
+    .setInputCols(["document"])\\
+    .setOutputCol("sentence")
+# Tokenizer
+tokenizer = Tokenizer()\\
+    .setInputCols(["sentence"])\\
+    .setOutputCol("token")
+# BERT Embeddings
+embeddings = BertEmbeddings.pretrained('bert_multi_cased', 'xx')\\
+    .setInputCols(["sentence", "token"])\\
+    .setOutputCol("embeddings")
+# NER Model
+public_ner = NerDLModel.pretrained('turkish_ner_bert', 'tr')\\
+    .setInputCols(["sentence", "token", "embeddings"])\\
+    .setOutputCol("ner")
+# NER Converter
+ner_converter = NerConverter()\\
+    .setInputCols(["sentence", "token", "ner"])\\
+    .setOutputCol("ner_chunk")
+# Pipeline
+nlp_pipeline = Pipeline(
+    stages=[
+        documentAssembler,
+        sentenceDetector,
+        tokenizer,
+        embeddings,
+        public_ner,
+        ner_converter
+    ]
+)
+""", language="python")
+# Summary
+st.markdown('<div class="sub-title">Summary</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <p>We have outlined two pipelines for performing Named Entity Recognition (NER) on Turkish texts using Spark NLP. The first pipeline uses GloVe embeddings, and the second one uses BERT embeddings. Both pipelines include stages for document assembly, sentence detection, tokenization, embedding generation, NER model application, and conversion of NER results into entity chunks.</p>
+    <p>These pipelines provide flexible options for leveraging pre-trained models in different contexts, allowing for scalable and accurate NER in Turkish.</p>
+</div>
+""", unsafe_allow_html=True)
+# References
+st.markdown('<div class="sub-title">References</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <ul>
+        <li><a class="link" href="https://sparknlp.org/api/python/reference/autosummary/sparknlp/annotator/word_embeddings_model/index.html" target="_blank" rel="noopener">WordEmbeddingsModel Documentation</a></li>
+        <li><a class="link" href="https://sparknlp.org/api/python/reference/autosummary/sparknlp/annotator/bert_embeddings/index.html" target="_blank" rel="noopener">BertEmbeddings Documentation</a></li>
+        <li><a class="link" href="https://sparknlp.org/api/python/reference/autosummary/sparknlp/annotator/ner_dl_model/index.html" target="_blank" rel="noopener">NerDLModel Documentation</a></li>
+        <li><a class="link" href="https://www.johnsnowlabs.com/spark-nlp/" target="_blank" rel="noopener">Spark NLP Official Site</a></li>
+    </ul>
+</div>
+""", unsafe_allow_html=True)
+# Community & Support
+st.markdown('<div class="sub-title">Community & Support</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <ul>
+        <li><a class="link" href="https://sparknlp.org/" target="_blank">Official Website</a>: Documentation and examples</li>
+        <li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp" target="_blank">GitHub Repository</a>: Report issues or contribute</li>
+        <li><a class="link" href="https://forum.johnsnowlabs.com/" target="_blank">Community Forum</a>: Ask questions, share ideas, and get support</li>
+    </ul>
+</div>
+""", unsafe_allow_html=True)

requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+streamlit
+st-annotated-text
+pandas
+numpy
+spark-nlp
+pyspark