Phani1008's picture
Update app.py
3ab14ea verified
import streamlit as st
# Function to display the Home Page
def show_home_page():
st.title("Natural Language Processing (NLP)")
st.markdown(
"""
### Welcome to NLP Guide
Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on the interaction between
computers and humans through natural language. It enables machines to read, understand, and respond to human
language in a way that is both meaningful and useful. NLP powers a wide range of applications like chatbots,
translation tools, sentiment analysis, and search engines.
Use the menu in the sidebar to explore each topic in detail.
"""
)
# Function to display specific topic pages
def show_page(page):
if page == "NLP Terminologies":
st.title("NLP Terminologies")
st.markdown(
"""
### NLP Terminologies (Detailed Explanation)
- **Tokenization**: Breaking text into smaller units like words or sentences.
- **Stop Words**: Commonly used words (e.g., "the", "is") often removed during preprocessing.
- **Stemming**: Reducing words to their root forms (e.g., "running" -> "run").
- **Lemmatization**: Converting words to their dictionary base forms (e.g., "running" -> "run").
- **Corpus**: A large collection of text used for NLP training and analysis.
- **Vocabulary**: The set of all unique words in a corpus.
- **n-grams**: Continuous sequences of n words/characters from text.
- **POS Tagging**: Assigning parts of speech to words.
- **NER (Named Entity Recognition)**: Identifying names, places, organizations, etc.
- **Parsing**: Analyzing grammatical structure of text.
"""
)
elif page == "One-Hot Vectorization":
st.title("One-Hot Vectorization")
st.markdown(
"""
### One-Hot Vectorization
A simple representation where each word in the vocabulary is represented as a binary vector.
#### How It Works:
- Each unique word in the corpus is assigned an index.
- The vector for a word is all zeros except for a 1 at the index corresponding to that word.
#### Example:
Vocabulary: ["cat", "dog", "bird"]
- "cat" -> [1, 0, 0]
- "dog" -> [0, 1, 0]
- "bird" -> [0, 0, 1]
#### Advantages:
- Simple to implement.
#### Limitations:
- High dimensionality for large vocabularies.
- Does not capture semantic relationships (e.g., "cat" and "kitten" are unrelated).
#### Applications:
- Useful for small datasets and when computational simplicity is prioritized.
"""
)
elif page == "Bag of Words":
st.title("Bag of Words (BoW)")
st.markdown(
"""
### Bag of Words (BoW)
Bag of Words is a method of representing text data as word frequency counts without considering word order.
#### How It Works:
1. Create a vocabulary of all unique words in the text.
2. Count the frequency of each word in a document.
#### Example:
Given two sentences:
- "I love NLP."
- "I love programming."
Vocabulary: ["I", "love", "NLP", "programming"]
- Sentence 1: [1, 1, 1, 0]
- Sentence 2: [1, 1, 0, 1]
#### Advantages:
- Simple to implement.
#### Limitations:
- High dimensionality for large vocabularies.
- Does not consider word order or semantic meaning.
- Sensitive to noise and frequent terms.
#### Applications:
- Text classification and clustering.
"""
)
elif page == "TF-IDF Vectorizer":
st.title("TF-IDF Vectorizer")
st.markdown(
"""
### TF-IDF Vectorizer
Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure that evaluates the importance of a word in a document relative to a collection of documents (corpus).
#### Formula:
\[ \text{TF-IDF} = \text{TF} \times \text{IDF} \]
- **Term Frequency (TF)**: Number of times a term appears in a document divided by total terms in the document.
- **Inverse Document Frequency (IDF)**: Logarithm of total documents divided by the number of documents containing the term.
#### Advantages:
- Reduces the weight of common words.
- Highlights unique and important words.
#### Example:
For the corpus:
- Doc1: "NLP is amazing."
- Doc2: "NLP is fun and amazing."
TF-IDF highlights words like "fun" and "amazing" over commonly occurring words like "is".
#### Applications:
- Search engines, information retrieval, and document classification.
"""
)
elif page == "Word2Vec":
st.title("Word2Vec")
st.markdown(
"""
### Word2Vec
Word2Vec is a neural network-based technique for creating dense vector representations of words, capturing their semantic relationships.
#### Key Concepts:
- **CBOW (Continuous Bag of Words)**: Predicts the target word from its context.
- **Skip-gram**: Predicts the context from the target word.
#### Advantages:
- Captures semantic meaning (e.g., "king" - "man" + "woman" β‰ˆ "queen").
- Efficient for large datasets.
#### Applications:
- Text classification, sentiment analysis, and recommendation systems.
#### Limitations:
- Requires significant computational resources.
"""
)
elif page == "FastText":
st.title("FastText")
st.markdown(
"""
### FastText
FastText is an extension of Word2Vec that represents words as a combination of character n-grams.
#### Advantages:
- Handles rare and out-of-vocabulary words.
- Captures subword information (e.g., prefixes and suffixes).
#### Example:
The word "playing" might be represented by n-grams like "pla", "lay", "ayi", "ing".
#### Applications:
- Multilingual text processing.
- Handling noisy and incomplete data.
#### Limitations:
- Higher computational cost compared to Word2Vec.
"""
)
elif page == "Tokenization":
st.title("Tokenization")
st.markdown(
"""
### Tokenization
Tokenization is the process of breaking text into smaller units (tokens) such as words, phrases, or sentences.
#### Types of Tokenization:
- **Word Tokenization**: Splits text into words.
- **Sentence Tokenization**: Splits text into sentences.
#### Libraries for Tokenization:
- NLTK, SpaCy, and Hugging Face Transformers.
#### Example:
Sentence: "NLP is exciting."
- Word Tokens: ["NLP", "is", "exciting", "."]
#### Applications:
- Preprocessing for machine learning models.
#### Challenges:
- Handling complex text like abbreviations and multilingual data.
"""
)
elif page == "Stop Words":
st.title("Stop Words")
st.markdown(
"""
### Stop Words
Stop words are commonly used words in a language that are often removed during text preprocessing.
#### Examples of Stop Words:
- English: "is", "the", "and", "in".
- Spanish: "es", "el", "y", "en".
#### Why Remove Stop Words?
- To reduce noise in text data.
#### Applications:
- Sentiment analysis, text classification, and search engines.
#### Challenges:
- Some stop words might carry context-specific importance.
"""
)
# Sidebar navigation
st.sidebar.title("NLP Topics")
menu_options = [
"Home",
"NLP Terminologies",
"One-Hot Vectorization",
"Bag of Words",
"TF-IDF Vectorizer",
"Word2Vec",
"FastText",
"Tokenization",
"Stop Words",
]
selected_page = st.sidebar.radio("Select a topic", menu_options)
# Display the selected page
if selected_page == "Home":
show_home_page()
else:
show_page(selected_page)