pranayreddy316 commited on
Commit
5905397
·
verified ·
1 Parent(s): e31e08c

Upload The NLP_Steps.py

Browse files
Files changed (1) hide show
  1. pages/The NLP_Steps.py +382 -0
pages/The NLP_Steps.py ADDED
@@ -0,0 +1,382 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import pandas as pd
3
+ import re
4
+
5
+
6
+ def main_page():
7
+ # Title of the app
8
+ st.title("Important Steps in NLP Project")
9
+
10
+ # Introduction
11
+ st.write("""
12
+ In our **ZERO TO HERO IN ML** app, we have already learned about the first two steps of an NLP project:
13
+ 1. **Problem Statement**
14
+ 2. **Data Collection**
15
+
16
+ On this page, we will explore the next three main steps specific to an NLP project. These steps are essential for processing and understanding text data.
17
+ """)
18
+
19
+ # Highlight the steps
20
+ st.header("Three Main Steps in an NLP Project")
21
+
22
+ # Step 1: Simple EDA of Text
23
+ st.subheader("1. Simple EDA of Text")
24
+ st.write("""
25
+ **Exploratory Data Analysis (EDA)** helps you understand the structure and quality of the text data.
26
+ Some key actions in EDA for text include:
27
+ - Checking for missing values
28
+ - Examining data distribution
29
+ - Identifying patterns like URLs, mentions (@, #), and numeric data
30
+ - Understanding the case format and punctuation
31
+ - Spotting special characters, HTML/XML tags, and emojis
32
+ """)
33
+ if st.button("Know More About Simple EDA"):
34
+ st.session_state.page = "simple_eda_app"
35
+
36
+ st.markdown("---")
37
+
38
+ # Step 2: Pre-Processing of Text
39
+ st.subheader("2. Pre-Processing of Text")
40
+ st.write("""
41
+ **Pre-processing** prepares the raw text data for analysis by:
42
+ - Converting text to lowercase (Case Normalization)
43
+ - Removing special characters, punctuation, and numbers
44
+ - Eliminating stopwords (e.g., "the", "and", "in")
45
+ - Expanding contractions (e.g., "can't" to "cannot")
46
+ - Handling URLs, emails, mentions, and hashtags
47
+ - Using Stemming or Lemmatization to reduce words to their base forms
48
+ - Converting emojis into textual descriptions or removing them
49
+ """)
50
+ if st.button("Know More About Pre-Processor"):
51
+ st.session_state.page = "pre_processing"
52
+
53
+ st.markdown("---")
54
+
55
+ # Step 3: Feature Engineering of Text
56
+ st.subheader("3. Feature Engineering of Text")
57
+ st.write("""
58
+ **Feature Engineering** involves extracting meaningful features from text data, such as:
59
+ - **Bag of Words (BoW)**: Converting text to word counts
60
+ - **TF-IDF (Term Frequency-Inverse Document Frequency)**: Highlighting important terms
61
+ - **Word Embeddings**: Representing words in numerical vector format (e.g., Word2Vec, GloVe, FastText)
62
+ - **N-grams**: Generating word sequences for richer context
63
+ - **Custom Features**: Length of the text, sentiment scores, and more
64
+ """)
65
+
66
+ st.markdown("---")
67
+
68
+ # Note
69
+ st.markdown("""
70
+ **Note:** These three steps are explained in the context of NLP projects that primarily deal with **text data**.
71
+ - Do not confuse these steps with the general roadmap for a machine learning project, as they are tailored for NLP-specific tasks.
72
+ """)
73
+
74
+ # Define the main EDA function
75
+ def simple_eda_app():
76
+ # Title and Introduction
77
+ st.title("Simple EDA for Text Data in NLP")
78
+ st.write("""
79
+ This application demonstrates various steps involved in Simple EDA (Exploratory Data Analysis) for text data.
80
+ These steps help assess the quality and structure of the collected text data, which is crucial for successful preprocessing and NLP projects.
81
+ """)
82
+
83
+ # Sample dataset
84
+ data = pd.DataFrame({
85
+ "Review": [
86
+ "I ❤️ programming with Python",
87
+ "Contact us at support@python.org",
88
+ "Debugging <i>errors</i> is tedious",
89
+ "@John loves Python",
90
+ "AI has grown exponentially in 2023",
91
+ "Visit https://www.github.com/",
92
+ "Coding is fun!",
93
+ "Learning AI is exciting",
94
+ "Learn AI in 12/05/2023"
95
+ ]
96
+ })
97
+
98
+ # Display dataset
99
+ st.write("Below is the sample dataset we will use:")
100
+ st.dataframe(data)
101
+
102
+ # Step selection dropdown
103
+ selected_option = st.selectbox(
104
+ "Choose a step to explore:",
105
+ [
106
+ "Introduction to Simple EDA",
107
+ "Check Case Format",
108
+ "Detect HTML/XML Tags",
109
+ "Detect Mentions (@, #)",
110
+ "Detect Numeric Data",
111
+ "Detect URLs",
112
+ "Detect Punctuation & Special Characters",
113
+ "Detect Emojis (Code Only)",
114
+ "Detect Dates and Times",
115
+ "Detect Emails"
116
+ ]
117
+ )
118
+
119
+ # Perform actions based on selected option
120
+ if selected_option == "Introduction to Simple EDA":
121
+ st.header("Introduction to Simple EDA for Text Data")
122
+ st.write("""
123
+ Exploratory Data Analysis (EDA) for text data helps examine, visualize, and summarize unstructured datasets.
124
+ These analyses reveal patterns, outliers, and inconsistencies to ensure better preprocessing and model accuracy.
125
+ """)
126
+
127
+ elif selected_option == "Check Case Format":
128
+ st.header("Step 1: Check Case Format")
129
+ code = """
130
+ data["Case Format"] = data["Review"].apply(
131
+ lambda x: "Lower/Upper" if x.islower() or x.isupper() else "Mixed"
132
+ )
133
+ """
134
+ st.code(code, language="python")
135
+ data["Case Format"] = data["Review"].apply(
136
+ lambda x: "Lower/Upper" if x.islower() or x.isupper() else "Mixed"
137
+ )
138
+ st.write("Identified case formats in the dataset:")
139
+ st.dataframe(data)
140
+
141
+ elif selected_option == "Detect HTML/XML Tags":
142
+ st.header("Step 2: Detect HTML/XML Tags")
143
+ code = """
144
+ data["Contains Tags"] = data["Review"].apply(lambda x: bool(re.search(r"<.*?>", x)))
145
+ """
146
+ st.code(code, language="python")
147
+ data["Contains Tags"] = data["Review"].apply(lambda x: bool(re.search(r"<.*?>", x)))
148
+ st.write("Rows with HTML/XML tags detected:")
149
+ st.dataframe(data)
150
+
151
+ elif selected_option == "Detect Mentions (@, #)":
152
+ st.header("Step 3: Detect Mentions (@, #)")
153
+ code = """
154
+ data["Contains Mentions"] = data["Review"].apply(lambda x: bool(re.search(r"\\B[@#]\\S+", x)))
155
+ """
156
+ st.code(code, language="python")
157
+ data["Contains Mentions"] = data["Review"].apply(lambda x: bool(re.search(r"\B[@#]\S+", x)))
158
+ st.write("Rows with mentions identified:")
159
+ st.dataframe(data)
160
+
161
+ elif selected_option == "Detect Numeric Data":
162
+ st.header("Step 4: Detect Numeric Data")
163
+ code = """
164
+ data["Contains Numeric"] = data["Review"].apply(lambda x: bool(re.search(r"\\d+", x)))
165
+ """
166
+ st.code(code, language="python")
167
+ data["Contains Numeric"] = data["Review"].apply(lambda x: bool(re.search(r"\d+", x)))
168
+ st.write("Rows containing numeric data:")
169
+ st.dataframe(data)
170
+
171
+ elif selected_option == "Detect URLs":
172
+ st.header("Step 5: Detect URLs")
173
+ code = """
174
+ data["Contains URL"] = data["Review"].apply(lambda x: bool(re.search(r"https?://\\S+", x)))
175
+ """
176
+ st.code(code, language="python")
177
+ data["Contains URL"] = data["Review"].apply(lambda x: bool(re.search(r"https?://\S+", x)))
178
+ st.write("Rows containing URLs:")
179
+ st.dataframe(data)
180
+
181
+ elif selected_option == "Detect Punctuation & Special Characters":
182
+ st.header("Step 6: Detect Punctuation & Special Characters")
183
+ code = """
184
+ data["Contains Punctuation"] = data["Review"].apply(
185
+ lambda x: bool(re.search(r'[!"#$%&\\'()*+,-./:;<=>?@[\\]^_`{|}~]', x))
186
+ )
187
+ """
188
+ st.code(code, language="python")
189
+ data["Contains Punctuation"] = data["Review"].apply(
190
+ lambda x: bool(re.search(r'[!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~]', x))
191
+ )
192
+ st.write("Rows with punctuation or special characters identified:")
193
+ st.dataframe(data)
194
+
195
+ elif selected_option == "Detect Emojis (Code Only)":
196
+ st.header("Step 7: Detect Emojis (Code Only)")
197
+ st.write("""
198
+ Here is the code for detecting emojis in text data using Python:
199
+ """)
200
+ code = """
201
+ import emoji
202
+
203
+ data["Contains Emojis"] = data["Review"].apply(lambda x: bool(emoji.emoji_count(x)))
204
+ """
205
+ st.code(code, language="python")
206
+ st.write("Emojis add meaning and emotion to text. Handle them based on your project needs.")
207
+
208
+ elif selected_option == "Detect Dates and Times":
209
+ st.header("Step 8: Detect Dates and Times")
210
+ code = """
211
+ data["Contains Date/Time"] = data["Review"].apply(
212
+ lambda x: bool(re.search(r"\\d{1,2}/\\d{1,2}/\\d{4}|\\d{4}/\\d{1,2}/\\d{1,2}", x))
213
+ )
214
+ """
215
+ st.code(code, language="python")
216
+ data["Contains Date/Time"] = data["Review"].apply(
217
+ lambda x: bool(re.search(r"\d{1,2}/\d{1,2}/\d{4}|\d{4}/\d{1,2}/\d{1,2}", x))
218
+ )
219
+ st.write("Rows with date and time information detected:")
220
+ st.dataframe(data)
221
+
222
+ elif selected_option == "Detect Emails":
223
+ st.header("Step 9: Detect Emails")
224
+ code = """
225
+ data["Contains Email"] = data["Review"].apply(lambda x: bool(re.search(r"\\S+@\\S+", x)))
226
+ """
227
+ st.code(code, language="python")
228
+ data["Contains Email"] = data["Review"].apply(lambda x: bool(re.search(r"\S+@\S+", x)))
229
+ st.write("Rows containing emails:")
230
+ st.dataframe(data)
231
+
232
+
233
+ def preprocessing():
234
+
235
+ # Set up the Streamlit layout
236
+ st.title("Text Preprocessing in NLP")
237
+ st.write("""
238
+ Preprocessing in Natural Language Processing (NLP) transforms raw, unstructured text data
239
+ into a clean format suitable for modeling. The following steps help standardize the data,
240
+ remove unwanted elements, and extract meaningful information.
241
+ """)
242
+
243
+ # Example Data
244
+ data = pd.DataFrame({
245
+ "Review": [
246
+ "I love Hyderabad Biryani!",
247
+ "I hate other places Biryani.",
248
+ "I like the Cooking process! 😊",
249
+ "Follow us on #Instagram @foodies. http://example.com"
250
+ ]
251
+ })
252
+
253
+ st.subheader("Original Data:")
254
+ st.dataframe(data)
255
+
256
+ # Step-1: Case Normalization
257
+ st.subheader("Step 1: Case Normalization")
258
+ st.write("Convert all text to lowercase to ensure consistency.")
259
+ st.code("""
260
+ data['Review'] = data['Review'].str.lower()
261
+ """)
262
+ data["Review"] = data["Review"].str.lower()
263
+ st.write("Updated Data (Lowercase Text):")
264
+ st.dataframe(data)
265
+
266
+ st.markdown("---")
267
+
268
+ # Step-2: Removing Noise (HTML Tags, URLs, Emails, Mentions/Hashtags)
269
+ st.subheader("Step 2: Removing Noise")
270
+ st.write("Remove unwanted special characters, HTML/XML tags, URLs, email addresses, mentions, and hashtags.")
271
+ st.code("""
272
+ # Removing HTML tags
273
+ data['Review'] = data['Review'].apply(lambda x: re.sub('<.*?>', ' ', x))
274
+
275
+ # Removing URLs
276
+ data['Review'] = data['Review'].apply(lambda x: re.sub('https?://\S+', ' ', x))
277
+
278
+ # Removing Emails
279
+ data['Review'] = data['Review'].apply(lambda x: re.sub(r'\S+@\S+', ' ', x))
280
+
281
+ # Removing Mentions and Hashtags
282
+ data['Review'] = data['Review'].apply(lambda x: re.sub(r'\B[@#]\S+', ' ', x))
283
+ """)
284
+ data["Review"] = data["Review"].apply(lambda x: re.sub('<.*?>', ' ', x))
285
+ data["Review"] = data["Review"].apply(lambda x: re.sub('https?://\S+', ' ', x))
286
+ data["Review"] = data["Review"].apply(lambda x: re.sub(r'\S+@\S+', ' ', x))
287
+ data["Review"] = data["Review"].apply(lambda x: re.sub(r'\B[@#]\S+', ' ', x))
288
+ st.write("Updated Data (After Noise Removal):")
289
+ st.dataframe(data)
290
+
291
+ st.markdown("---")
292
+
293
+ # Step-3: Emoji Handling
294
+ st.subheader("Step 3: Emoji Handling")
295
+ st.write("Convert emojis to descriptive text or remove them.")
296
+ st.code("""
297
+ # Example: Replace emojis with a placeholder 'EMOJI'
298
+ data['Review'] = data['Review'].apply(lambda x: emoji.demojize(x, delimiters=(' ', ' ')))
299
+ """)
300
+ data["Review"] = data["Review"].apply(lambda x: re.sub(r'[^\x00-\x7F]+', 'EMOJI', x)) # Replace emojis with 'EMOJI'
301
+ st.write("Updated Data (After Emoji Handling):")
302
+ st.dataframe(data)
303
+
304
+ st.markdown("---")
305
+
306
+ # Step-4: Removing Stopwords (Excluding NLTK)
307
+ st.subheader("Step 4: Removing Stopwords")
308
+ st.write("Remove common words like 'and', 'is', which don't add value.")
309
+ st.code("""
310
+ stopwords = ["and", "the", "is", "in", "to", "for", "on"]
311
+ data['Review'] = data['Review'].apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords]))
312
+ """)
313
+ stopwords = ["and", "the", "is", "in", "to", "for", "on"] # Example stopwords list
314
+ data["Review"] = data["Review"].apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords]))
315
+ st.write("Updated Data (After Stopwords Removal):")
316
+ st.dataframe(data)
317
+
318
+ st.markdown("---")
319
+
320
+ # Step-5: Removing Punctuation and Digits
321
+ st.subheader("Step 5: Removing Punctuation and Digits")
322
+ st.write("Remove punctuation marks and digits if not meaningful.")
323
+ st.code("""
324
+ # Removing Punctuation
325
+ data['Review'] = data['Review'].apply(lambda x: re.sub(r'[^\w\s]', ' ', x))
326
+
327
+ # Removing Digits
328
+ data['Review'] = data['Review'].apply(lambda x: re.sub(r'\d+', '', x))
329
+ """)
330
+ data["Review"] = data["Review"].apply(lambda x: re.sub(r'[^\w\s]', ' ', x))
331
+ data["Review"] = data["Review"].apply(lambda x: re.sub(r'\d+', '', x))
332
+ st.write("Updated Data (After Removing Punctuation and Digits):")
333
+ st.dataframe(data)
334
+
335
+ st.markdown("---")
336
+
337
+ # Step-6: Fixing Contractions
338
+ st.subheader("Step 6: Fixing Contractions")
339
+ st.write("Expand contractions like 'can't' to 'cannot'.")
340
+ st.code("""
341
+ contractions_dict = {"can't": "cannot", "won't": "will not", "I'm": "I am", "you're": "you are"}
342
+ data['Review'] = data['Review'].apply(lambda x: ' '.join([contractions_dict.get(word, word) for word in x.split()]))
343
+ """)
344
+ contractions_dict = {"can't": "cannot", "won't": "will not", "I'm": "I am", "you're": "you are"} # Example contraction dictionary
345
+ data["Review"] = data["Review"].apply(lambda x: ' '.join([contractions_dict.get(word, word) for word in x.split()]))
346
+ st.write("Updated Data (After Fixing Contractions):")
347
+ st.dataframe(data)
348
+
349
+ st.markdown("---")
350
+
351
+ # Step-7: Handling Dates and Times
352
+ st.subheader("Step 7: Handling Dates and Times")
353
+ st.write("Standardize dates and times into a uniform format.")
354
+ st.code("""
355
+ # Example: Replacing date-like patterns with 'DATE'
356
+ data['Review'] = data['Review'].apply(lambda x: re.sub(r'\b\d{1,2}\/\d{1,2}\/\d{4}\b', 'DATE', x))
357
+ """)
358
+ data["Review"] = data["Review"].apply(lambda x: re.sub(r'\b\d{1,2}\/\d{1,2}\/\d{4}\b', 'DATE', x))
359
+ st.write("Updated Data (After Handling Dates and Times):")
360
+ st.dataframe(data)
361
+
362
+ st.markdown("---")
363
+
364
+ # Display final clean data
365
+ st.subheader("Final Cleaned Data:")
366
+ st.dataframe(data)
367
+
368
+
369
+
370
+
371
+
372
+ # Run the EDA app if the button is clicked
373
+ if 'page' not in st.session_state:
374
+ st.session_state.page = 'main'
375
+
376
+ # Navigation logic
377
+ if st.session_state.page == 'main':
378
+ main_page()
379
+ elif st.session_state.page == 'simple_eda_app':
380
+ simple_eda_app()
381
+ elif st.session_state.page == 'pre_processing':
382
+ preprocessing()