Spaces:

Phani1008
/

Natural_Language_Processing

Sleeping

App Files Files Community

Natural_Language_Processing / app.py

Phani1008

Update app.py

3ab14ea verified 5 months ago

raw

history blame contribute delete

8.77 kB

	import streamlit as st

	# Function to display the Home Page
	def show_home_page():
	st.title("Natural Language Processing (NLP)")
	st.markdown(
	"""
	### Welcome to NLP Guide
	Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on the interaction between
	computers and humans through natural language. It enables machines to read, understand, and respond to human
	language in a way that is both meaningful and useful. NLP powers a wide range of applications like chatbots,
	translation tools, sentiment analysis, and search engines.

	Use the menu in the sidebar to explore each topic in detail.
	"""
	)

	# Function to display specific topic pages
	def show_page(page):
	if page == "NLP Terminologies":
	st.title("NLP Terminologies")
	st.markdown(
	"""
	### NLP Terminologies (Detailed Explanation)

	- Tokenization: Breaking text into smaller units like words or sentences.
	- Stop Words: Commonly used words (e.g., "the", "is") often removed during preprocessing.
	- Stemming: Reducing words to their root forms (e.g., "running" -> "run").
	- Lemmatization: Converting words to their dictionary base forms (e.g., "running" -> "run").
	- Corpus: A large collection of text used for NLP training and analysis.
	- Vocabulary: The set of all unique words in a corpus.
	- n-grams: Continuous sequences of n words/characters from text.
	- POS Tagging: Assigning parts of speech to words.
	- NER (Named Entity Recognition): Identifying names, places, organizations, etc.
	- Parsing: Analyzing grammatical structure of text.
	"""
	)
	elif page == "One-Hot Vectorization":
	st.title("One-Hot Vectorization")
	st.markdown(
	"""
	### One-Hot Vectorization

	A simple representation where each word in the vocabulary is represented as a binary vector.

	#### How It Works:
	- Each unique word in the corpus is assigned an index.
	- The vector for a word is all zeros except for a 1 at the index corresponding to that word.

	#### Example:
	Vocabulary: ["cat", "dog", "bird"]
	- "cat" -> [1, 0, 0]
	- "dog" -> [0, 1, 0]
	- "bird" -> [0, 0, 1]

	#### Advantages:
	- Simple to implement.

	#### Limitations:
	- High dimensionality for large vocabularies.
	- Does not capture semantic relationships (e.g., "cat" and "kitten" are unrelated).

	#### Applications:
	- Useful for small datasets and when computational simplicity is prioritized.
	"""
	)
	elif page == "Bag of Words":
	st.title("Bag of Words (BoW)")
	st.markdown(
	"""
	### Bag of Words (BoW)

	Bag of Words is a method of representing text data as word frequency counts without considering word order.

	#### How It Works:
	1. Create a vocabulary of all unique words in the text.
	2. Count the frequency of each word in a document.

	#### Example:
	Given two sentences:
	- "I love NLP."
	- "I love programming."

	Vocabulary: ["I", "love", "NLP", "programming"]
	- Sentence 1: [1, 1, 1, 0]
	- Sentence 2: [1, 1, 0, 1]

	#### Advantages:
	- Simple to implement.

	#### Limitations:
	- High dimensionality for large vocabularies.
	- Does not consider word order or semantic meaning.
	- Sensitive to noise and frequent terms.

	#### Applications:
	- Text classification and clustering.
	"""
	)
	elif page == "TF-IDF Vectorizer":
	st.title("TF-IDF Vectorizer")
	st.markdown(
	"""
	### TF-IDF Vectorizer

	Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure that evaluates the importance of a word in a document relative to a collection of documents (corpus).

	#### Formula:
	\[ \text{TF-IDF} = \text{TF} \times \text{IDF} \]

	- Term Frequency (TF): Number of times a term appears in a document divided by total terms in the document.
	- Inverse Document Frequency (IDF): Logarithm of total documents divided by the number of documents containing the term.

	#### Advantages:
	- Reduces the weight of common words.
	- Highlights unique and important words.

	#### Example:
	For the corpus:
	- Doc1: "NLP is amazing."
	- Doc2: "NLP is fun and amazing."

	TF-IDF highlights words like "fun" and "amazing" over commonly occurring words like "is".

	#### Applications:
	- Search engines, information retrieval, and document classification.
	"""
	)
	elif page == "Word2Vec":
	st.title("Word2Vec")
	st.markdown(
	"""
	### Word2Vec

	Word2Vec is a neural network-based technique for creating dense vector representations of words, capturing their semantic relationships.

	#### Key Concepts:
	- CBOW (Continuous Bag of Words): Predicts the target word from its context.
	- Skip-gram: Predicts the context from the target word.

	#### Advantages:
	- Captures semantic meaning (e.g., "king" - "man" + "woman" ≈ "queen").
	- Efficient for large datasets.

	#### Applications:
	- Text classification, sentiment analysis, and recommendation systems.

	#### Limitations:
	- Requires significant computational resources.
	"""
	)
	elif page == "FastText":
	st.title("FastText")
	st.markdown(
	"""
	### FastText

	FastText is an extension of Word2Vec that represents words as a combination of character n-grams.

	#### Advantages:
	- Handles rare and out-of-vocabulary words.
	- Captures subword information (e.g., prefixes and suffixes).

	#### Example:
	The word "playing" might be represented by n-grams like "pla", "lay", "ayi", "ing".

	#### Applications:
	- Multilingual text processing.
	- Handling noisy and incomplete data.

	#### Limitations:
	- Higher computational cost compared to Word2Vec.
	"""
	)
	elif page == "Tokenization":
	st.title("Tokenization")
	st.markdown(
	"""
	### Tokenization

	Tokenization is the process of breaking text into smaller units (tokens) such as words, phrases, or sentences.

	#### Types of Tokenization:
	- Word Tokenization: Splits text into words.
	- Sentence Tokenization: Splits text into sentences.

	#### Libraries for Tokenization:
	- NLTK, SpaCy, and Hugging Face Transformers.

	#### Example:
	Sentence: "NLP is exciting."
	- Word Tokens: ["NLP", "is", "exciting", "."]

	#### Applications:
	- Preprocessing for machine learning models.

	#### Challenges:
	- Handling complex text like abbreviations and multilingual data.
	"""
	)
	elif page == "Stop Words":
	st.title("Stop Words")
	st.markdown(
	"""
	### Stop Words

	Stop words are commonly used words in a language that are often removed during text preprocessing.

	#### Examples of Stop Words:
	- English: "is", "the", "and", "in".
	- Spanish: "es", "el", "y", "en".

	#### Why Remove Stop Words?
	- To reduce noise in text data.

	#### Applications:
	- Sentiment analysis, text classification, and search engines.

	#### Challenges:
	- Some stop words might carry context-specific importance.
	"""
	)

	# Sidebar navigation
	st.sidebar.title("NLP Topics")
	menu_options = [
	"Home",
	"NLP Terminologies",
	"One-Hot Vectorization",
	"Bag of Words",
	"TF-IDF Vectorizer",
	"Word2Vec",
	"FastText",
	"Tokenization",
	"Stop Words",
	]
	selected_page = st.sidebar.radio("Select a topic", menu_options)

	# Display the selected page
	if selected_page == "Home":
	show_home_page()
	else:
	show_page(selected_page)