File size: 7,228 Bytes
f404f13
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
import streamlit as st
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Download NLTK data
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")

# Streamlit app configuration
st.set_page_config(page_title="NLP Basics for Beginners", page_icon="πŸ€–", layout="wide")
st.title("πŸ€– NLP Basics for Beginners")
st.markdown(
    """
Welcome to the **NLP Basics App**!  
Here, you'll learn about the foundational concepts of **Natural Language Processing (NLP)** through interactive examples.  
Let's explore:
- **What is NLP?** Its applications and use cases.  
- **Text Representation Basics**: Tokens, sentences, words, stopwords, lemmatization, stemming.  
- **Vectorization Techniques**: Bag of Words (BoW) and TF-IDF.
"""
)

# Divider
st.markdown("---")

# Sidebar Navigation
st.sidebar.title("Navigation")
sections = ["Introduction to NLP", "Tokenization", "Stopwords", "Lemmatization & Stemming", "Bag of Words (BoW)", "TF-IDF"]
selected_section = st.sidebar.radio("Choose a section", sections)

# Input Text Box
st.sidebar.write("### Enter Text to Analyze:")
text_input = st.sidebar.text_area("Input your text here:", height=150, placeholder="Type or paste some text here...")

if not text_input.strip():
    st.sidebar.warning("Please enter some text to explore NLP concepts.")

# Section 1: Introduction to NLP
if selected_section == "Introduction to NLP":
    st.header("πŸ’‘ What is NLP?")
    st.write(
        """
Natural Language Processing (NLP) is a field of Artificial Intelligence (AI) focused on the interaction between computers and human language.  
It enables machines to understand, interpret, and generate human language.

### **Applications of NLP**:
- **Chatbots**: AI-powered conversational agents (e.g., Siri, Alexa).  
- **Text Summarization**: Extracting important information from lengthy documents.  
- **Machine Translation**: Translating text between languages (e.g., Google Translate).  
- **Sentiment Analysis**: Understanding opinions in social media or reviews (positive/negative/neutral).  
"""
    )
    st.image("https://miro.medium.com/max/1400/1*H0qcbsUCWkE7O__q2XkKYA.png", caption="Applications of NLP", use_column_width=True)

# Section 2: Tokenization
if selected_section == "Tokenization":
    st.header("πŸ”€ Tokenization")
    st.write(
        """
**Tokenization** is the process of breaking down text into smaller units, like sentences or words.  
It is a critical first step in many NLP tasks.

### Types of Tokenization:
1. **Sentence Tokenization**: Splitting text into sentences.
2. **Word Tokenization**: Splitting text into individual words (tokens).

**Example Input**: "I love NLP. It's amazing!"  
**Sentence Tokens**: ["I love NLP.", "It's amazing!"]  
**Word Tokens**: ["I", "love", "NLP", ".", "It", "'s", "amazing", "!"]
"""
    )
    if text_input.strip():
        st.subheader("Try Tokenization on Your Input Text")
        st.write("**Sentence Tokenization**:")
        sentences = sent_tokenize(text_input)
        st.write(sentences)

        st.write("**Word Tokenization**:")
        words = word_tokenize(text_input)
        st.write(words)

# Section 3: Stopwords
if selected_section == "Stopwords":
    st.header("πŸ›‘ Stopwords")
    st.write(
        """
**Stopwords** are common words (e.g., "and", "is", "the") that add little meaning to text and are often removed in NLP tasks.

Removing stopwords helps focus on the essential words in a text.  
For example:  
**Input**: "This is an example of stopwords removal."  
**Output**: ["example", "stopwords", "removal"]
"""
    )
    if text_input.strip():
        st.subheader("Remove Stopwords from Your Input Text")
        stop_words = set(stopwords.words("english"))
        words = word_tokenize(text_input)
        filtered_words = [word for word in words if word.lower() not in stop_words]
        st.write("**Original Words**:", words)
        st.write("**Words after Stopwords Removal**:", filtered_words)

# Section 4: Lemmatization & Stemming
if selected_section == "Lemmatization & Stemming":
    st.header("🌱 Lemmatization and Stemming")
    st.write(
        """
### **Stemming**:
Reduces words to their root form by chopping off prefixes/suffixes.  
**Example**: "running" β†’ "run", "studies" β†’ "studi"  

### **Lemmatization**:
Returns the base (dictionary) form of a word using context.  
**Example**: "running" β†’ "run", "better" β†’ "good"  
"""
    )
    if text_input.strip():
        st.subheader("Apply Stemming and Lemmatization")
        words = word_tokenize(text_input)

        ps = PorterStemmer()
        stemmed_words = [ps.stem(word) for word in words]
        st.write("**Stemmed Words**:", stemmed_words)

        lemmatizer = WordNetLemmatizer()
        lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
        st.write("**Lemmatized Words**:", lemmatized_words)

# Section 5: Bag of Words (BoW)
if selected_section == "Bag of Words (BoW)":
    st.header("πŸ“¦ Bag of Words (BoW)")
    st.write(
        """
**Bag of Words (BoW)** is a text representation technique that converts text into a vector of word frequencies.  
It ignores word order but considers the occurrence of words.

### Example:
**Input Texts**:  
1. "I love NLP."  
2. "NLP is amazing!"  

**BoW Matrix**:  
|      | I | love | NLP | is | amazing |  
|------|---|------|-----|----|---------|  
| Text1| 1 | 1    | 1   | 0  | 0       |  
| Text2| 0 | 0    | 1   | 1  | 1       |
"""
    )
    if text_input.strip():
        st.subheader("Generate BoW for Your Input Text")
        vectorizer = CountVectorizer()
        X = vectorizer.fit_transform([text_input])
        st.write("**BoW Matrix**:")
        st.write(X.toarray())
        st.write("**Feature Names (Words):**")
        st.write(vectorizer.get_feature_names_out())

# Section 6: TF-IDF
if selected_section == "TF-IDF":
    st.header("πŸ“Š TF-IDF (Term Frequency-Inverse Document Frequency)")
    st.write(
        """
**TF-IDF** is a statistical measure that evaluates how important a word is to a document in a collection of documents.  
It balances the frequency of a word with its rarity across documents.

### Formula:
- **Term Frequency (TF)**: How often a word appears in a document.
- **Inverse Document Frequency (IDF)**: Log of the total documents divided by the number of documents containing the word.

**Example**:
- "NLP is amazing."
- "I love NLP."

TF-IDF assigns higher weights to rare but significant words.
"""
    )
    if text_input.strip():
        st.subheader("Generate TF-IDF for Your Input Text")
        tfidf_vectorizer = TfidfVectorizer()
        tfidf_matrix = tfidf_vectorizer.fit_transform([text_input])
        st.write("**TF-IDF Matrix**:")
        st.write(tfidf_matrix.toarray())
        st.write("**Feature Names (Words):**")
        st.write(tfidf_vectorizer.get_feature_names_out())

# Footer
st.markdown("---")
st.markdown(
    """
<center>
    <p style='font-size:14px;'>Β© 2024 NLP Basics App. All Rights Reserved.</p>
</center>
""",
    unsafe_allow_html=True,
)