Spaces:
Sleeping
Sleeping
File size: 7,228 Bytes
f404f13 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 |
import streamlit as st
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# Download NLTK data
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")
# Streamlit app configuration
st.set_page_config(page_title="NLP Basics for Beginners", page_icon="π€", layout="wide")
st.title("π€ NLP Basics for Beginners")
st.markdown(
"""
Welcome to the **NLP Basics App**!
Here, you'll learn about the foundational concepts of **Natural Language Processing (NLP)** through interactive examples.
Let's explore:
- **What is NLP?** Its applications and use cases.
- **Text Representation Basics**: Tokens, sentences, words, stopwords, lemmatization, stemming.
- **Vectorization Techniques**: Bag of Words (BoW) and TF-IDF.
"""
)
# Divider
st.markdown("---")
# Sidebar Navigation
st.sidebar.title("Navigation")
sections = ["Introduction to NLP", "Tokenization", "Stopwords", "Lemmatization & Stemming", "Bag of Words (BoW)", "TF-IDF"]
selected_section = st.sidebar.radio("Choose a section", sections)
# Input Text Box
st.sidebar.write("### Enter Text to Analyze:")
text_input = st.sidebar.text_area("Input your text here:", height=150, placeholder="Type or paste some text here...")
if not text_input.strip():
st.sidebar.warning("Please enter some text to explore NLP concepts.")
# Section 1: Introduction to NLP
if selected_section == "Introduction to NLP":
st.header("π‘ What is NLP?")
st.write(
"""
Natural Language Processing (NLP) is a field of Artificial Intelligence (AI) focused on the interaction between computers and human language.
It enables machines to understand, interpret, and generate human language.
### **Applications of NLP**:
- **Chatbots**: AI-powered conversational agents (e.g., Siri, Alexa).
- **Text Summarization**: Extracting important information from lengthy documents.
- **Machine Translation**: Translating text between languages (e.g., Google Translate).
- **Sentiment Analysis**: Understanding opinions in social media or reviews (positive/negative/neutral).
"""
)
st.image("https://miro.medium.com/max/1400/1*H0qcbsUCWkE7O__q2XkKYA.png", caption="Applications of NLP", use_column_width=True)
# Section 2: Tokenization
if selected_section == "Tokenization":
st.header("π€ Tokenization")
st.write(
"""
**Tokenization** is the process of breaking down text into smaller units, like sentences or words.
It is a critical first step in many NLP tasks.
### Types of Tokenization:
1. **Sentence Tokenization**: Splitting text into sentences.
2. **Word Tokenization**: Splitting text into individual words (tokens).
**Example Input**: "I love NLP. It's amazing!"
**Sentence Tokens**: ["I love NLP.", "It's amazing!"]
**Word Tokens**: ["I", "love", "NLP", ".", "It", "'s", "amazing", "!"]
"""
)
if text_input.strip():
st.subheader("Try Tokenization on Your Input Text")
st.write("**Sentence Tokenization**:")
sentences = sent_tokenize(text_input)
st.write(sentences)
st.write("**Word Tokenization**:")
words = word_tokenize(text_input)
st.write(words)
# Section 3: Stopwords
if selected_section == "Stopwords":
st.header("π Stopwords")
st.write(
"""
**Stopwords** are common words (e.g., "and", "is", "the") that add little meaning to text and are often removed in NLP tasks.
Removing stopwords helps focus on the essential words in a text.
For example:
**Input**: "This is an example of stopwords removal."
**Output**: ["example", "stopwords", "removal"]
"""
)
if text_input.strip():
st.subheader("Remove Stopwords from Your Input Text")
stop_words = set(stopwords.words("english"))
words = word_tokenize(text_input)
filtered_words = [word for word in words if word.lower() not in stop_words]
st.write("**Original Words**:", words)
st.write("**Words after Stopwords Removal**:", filtered_words)
# Section 4: Lemmatization & Stemming
if selected_section == "Lemmatization & Stemming":
st.header("π± Lemmatization and Stemming")
st.write(
"""
### **Stemming**:
Reduces words to their root form by chopping off prefixes/suffixes.
**Example**: "running" β "run", "studies" β "studi"
### **Lemmatization**:
Returns the base (dictionary) form of a word using context.
**Example**: "running" β "run", "better" β "good"
"""
)
if text_input.strip():
st.subheader("Apply Stemming and Lemmatization")
words = word_tokenize(text_input)
ps = PorterStemmer()
stemmed_words = [ps.stem(word) for word in words]
st.write("**Stemmed Words**:", stemmed_words)
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
st.write("**Lemmatized Words**:", lemmatized_words)
# Section 5: Bag of Words (BoW)
if selected_section == "Bag of Words (BoW)":
st.header("π¦ Bag of Words (BoW)")
st.write(
"""
**Bag of Words (BoW)** is a text representation technique that converts text into a vector of word frequencies.
It ignores word order but considers the occurrence of words.
### Example:
**Input Texts**:
1. "I love NLP."
2. "NLP is amazing!"
**BoW Matrix**:
| | I | love | NLP | is | amazing |
|------|---|------|-----|----|---------|
| Text1| 1 | 1 | 1 | 0 | 0 |
| Text2| 0 | 0 | 1 | 1 | 1 |
"""
)
if text_input.strip():
st.subheader("Generate BoW for Your Input Text")
vectorizer = CountVectorizer()
X = vectorizer.fit_transform([text_input])
st.write("**BoW Matrix**:")
st.write(X.toarray())
st.write("**Feature Names (Words):**")
st.write(vectorizer.get_feature_names_out())
# Section 6: TF-IDF
if selected_section == "TF-IDF":
st.header("π TF-IDF (Term Frequency-Inverse Document Frequency)")
st.write(
"""
**TF-IDF** is a statistical measure that evaluates how important a word is to a document in a collection of documents.
It balances the frequency of a word with its rarity across documents.
### Formula:
- **Term Frequency (TF)**: How often a word appears in a document.
- **Inverse Document Frequency (IDF)**: Log of the total documents divided by the number of documents containing the word.
**Example**:
- "NLP is amazing."
- "I love NLP."
TF-IDF assigns higher weights to rare but significant words.
"""
)
if text_input.strip():
st.subheader("Generate TF-IDF for Your Input Text")
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform([text_input])
st.write("**TF-IDF Matrix**:")
st.write(tfidf_matrix.toarray())
st.write("**Feature Names (Words):**")
st.write(tfidf_vectorizer.get_feature_names_out())
# Footer
st.markdown("---")
st.markdown(
"""
<center>
<p style='font-size:14px;'>Β© 2024 NLP Basics App. All Rights Reserved.</p>
</center>
""",
unsafe_allow_html=True,
)
|