Spaces:

sree4411
/

NLP

Sleeping

App Files Files Community

NLP / app.py

sree4411

Update app.py

29c008d verified about 1 year ago

raw

history blame contribute delete

5.99 kB

	import streamlit as st
	from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
	import numpy as np
	from gensim.models import Word2Vec

	# Title
	st.title(":red[Introduction to NLP]")

	# Section: What is NLP?
	st.header(":blue[What is NLP?]")
	st.write("""
	Natural Language Processing (NLP) is a subfield of artificial intelligence that enables computers to process, understand, and generate human language.

	### Applications of NLP:
	- Chatbots & Virtual Assistants (e.g., Siri, Alexa)
	- Sentiment Analysis (e.g., Product reviews, Social Media monitoring)
	- Machine Translation (e.g., Google Translate)
	- Text Summarization (e.g., News article summaries)
	- Speech Recognition (e.g., Voice commands)
	""")

	# Section: NLP Terminologies
	st.header(":blue[NLP Terminologies]")
	st.write("""
	- Corpus: A collection of text documents used for NLP tasks.
	- Tokenization: Splitting text into individual words or phrases.
	- Stop Words: Common words (e.g., "the", "is") that are often removed.
	- Stemming: Reducing words to their base form (e.g., "running" → "run").
	- Lemmatization: More advanced than stemming; it converts words to their dictionary form.
	- Named Entity Recognition (NER): Identifies entities like names, dates, and locations.
	- Sentiment Analysis: Determines the sentiment (positive, negative, neutral) of a text.
	- n-grams: Sequences of 'n' consecutive words (e.g., "New York" is a bi-gram).
	""")

	# Section: Text Representation Methods
	st.header(":blue[Text Representation Methods]")
	methods = [
	"Bag of Words",
	"TF-IDF",
	"One-Hot Encoding",
	"Word Embeddings (Word2Vec)"
	]
	selected_method = st.radio("Select a text representation method:", methods)

	if selected_method == "Bag of Words":
	st.subheader(":blue[Bag of Words (BoW)]")
	st.write("""
	Definition: Bag of Words (BoW) is a simple text representation technique that converts text into numerical data by counting the occurrence of each word in a document. It ignores grammar, word order, and context.

	How it works:
	- Each unique word in a dataset becomes a feature.
	- The text is converted into a frequency-based numerical representation.
	- The more a word appears in a document, the higher its count.

	Uses:
	- Sentiment analysis
	- Document classification
	- Spam detection
	- Information retrieval

	Advantages:
	✅ Simple and easy to implement
	✅ Works well with traditional machine learning models

	Disadvantages:
	❌ Ignores word order and meaning
	❌ High-dimensionality for large vocabularies
	❌ Cannot differentiate between synonyms (e.g., "happy" and "joyful")
	""")

	elif selected_method == "TF-IDF":
	st.subheader(":blue[Term Frequency-Inverse Document Frequency (TF-IDF)]")
	st.write("""
	Definition: TF-IDF is an advanced version of Bag of Words that assigns importance to words based on how frequently they appear in a document while reducing the importance of common words.

	How it works:
	- Term Frequency (TF): Measures how often a word appears in a document.
	- Inverse Document Frequency (IDF): Reduces the weight of words that are very common across all documents.
	- The final score is calculated as: TF × IDF.

	Uses:
	- Information retrieval (e.g., search engines)
	- Text classification
	- Keyword extraction
	- Document similarity detection

	Advantages:
	✅ Reduces the impact of common words like "the", "is", etc.
	✅ Highlights important words in a document
	✅ Better than BoW for capturing relevance

	Disadvantages:
	❌ Still ignores word order
	❌ Cannot capture deep semantic meanings
	❌ Computationally expensive for very large datasets
	""")

	elif selected_method == "One-Hot Encoding":
	st.subheader(":blue[One-Hot Encoding]")
	st.write("""
	Definition: One-hot encoding is a simple representation method where each unique word in a vocabulary is represented as a binary vector.

	How it works:
	- Each word is assigned a unique index in a vocabulary.
	- A word is represented as a vector where all values are 0 except for the position of that word, which is 1.
	- For example, if the vocabulary consists of ["NLP", "is", "great"], then "NLP" is represented as [1, 0, 0].

	Uses:
	- Simple NLP tasks
	- Word-level feature engineering
	- Early-stage text processing in machine learning models

	Advantages:
	✅ Simple and easy to understand
	✅ Works well for small vocabulary sizes

	Disadvantages:
	❌ Inefficient for large vocabularies (results in sparse vectors)
	❌ Does not capture word meaning or relationships
	""")

	elif selected_method == "Word Embeddings (Word2Vec)":
	st.subheader(":blue[Word Embeddings (Word2Vec)]")
	st.write("""
	Definition: Word embeddings convert words into dense numerical vectors that capture semantic meaning. Unlike BoW and TF-IDF, word embeddings preserve relationships between words.

	How it works:
	- Words are represented as high-dimensional vectors (e.g., 100 or 300 dimensions).
	- Words with similar meanings have closer vectors.
	- It is trained using techniques like CBOW (Continuous Bag of Words) and Skip-gram.

	Uses:
	- Machine translation
	- Speech recognition
	- Sentiment analysis
	- Document clustering

	Advantages:
	✅ Captures semantic relationships between words
	✅ Works well with deep learning models
	✅ Can detect synonyms and analogies (e.g., "king" - "man" + "woman" = "queen")

	Disadvantages:
	❌ Requires large datasets to train
	❌ Computationally expensive
	❌ Needs domain-specific tuning for best performance
	""")

	# Footer
	st.write("---")
	st.write("Developed with ❤️ using Streamlit for NLP enthusiasts.")