pranayreddy316 commited on
Commit
43af36a
·
verified ·
1 Parent(s): 888390e

Upload The NLP Basic_Terminologies.py

Browse files
Files changed (1) hide show
  1. pages/The NLP Basic_Terminologies.py +119 -0
pages/The NLP Basic_Terminologies.py ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+
3
+ # Streamlit App Title and Introduction
4
+ st.title("Basic Terminology in NLP")
5
+
6
+ st.write(
7
+ """
8
+ Before diving deep into the concepts of NLP, it's crucial to understand the basic terminologies frequently used in this domain.
9
+ These terms lay the foundation for exploring more advanced NLP topics.
10
+ """
11
+ )
12
+
13
+ # Section: Key Terminologies in NLP
14
+ st.header("1. Key Terminologies in NLP")
15
+ st.write(
16
+ """
17
+ - **Corpus**: A collection of text documents.
18
+ Example: {d1, d2, d3, ...}
19
+ - **Document**: A single unit of text (e.g., a sentence, paragraph, or article).
20
+ - **Paragraph**: A collection of sentences.
21
+ - **Sentence**: A collection of words forming a meaningful expression.
22
+ - **Word**: A collection of characters.
23
+ - **Character**: A basic unit like an alphabet, number, or special symbol.
24
+ """
25
+ )
26
+
27
+ # Section: Tokenization
28
+ st.header("2. Tokenization")
29
+ st.write(
30
+ """
31
+ Tokenization is the process of splitting text into smaller units, called tokens.
32
+
33
+ Types of Tokenization:
34
+ - **Sentence Tokenization**: Splitting text into sentences.
35
+ Example: "I love ice-cream. I love chocolate." → ["I love ice-cream", "I love chocolate"]
36
+ - **Word Tokenization**: Splitting sentences into words.
37
+ Example: "I love biryani" → ["I", "love", "biryani"]
38
+ - **Character Tokenization**: Splitting words into characters.
39
+ Example: "Love" → ["L", "o", "v", "e"]
40
+ """
41
+ )
42
+
43
+ if st.button("Try Tokenization Example"):
44
+ text = "Streamlit makes NLP visualization interactive."
45
+ st.write(f"Original Text: {text}")
46
+ st.write(f"Word Tokens: {text.split()}")
47
+
48
+ # Section: Stop Words
49
+ st.header("3. Stop Words")
50
+ st.write(
51
+ """
52
+ Stop words are commonly used words in a language that are ignored during text processing as they contribute little to the overall meaning.
53
+
54
+ Example:
55
+ - Sentence: "In Hyderabad, we can eat famous biryani."
56
+ - Stop words: ["in", "we", "can"]
57
+ """
58
+ )
59
+
60
+ if st.button("View Processed Text without Stop Words"):
61
+ text = "In Hyderabad, we can eat famous biryani."
62
+ stop_words = ["in", "we", "can"]
63
+ filtered_text = " ".join([word for word in text.split() if word.lower() not in stop_words])
64
+ st.write(f"Processed Text: {filtered_text}")
65
+
66
+ # Section: Vectorization
67
+ st.header("4. Vectorization")
68
+ st.write(
69
+ """
70
+ Vectorization converts text data into numerical formats for machine learning models, enabling text processing and analysis.
71
+
72
+ Types of Vectorization:
73
+ - **One-Hot Encoding**: Represents each word as a binary vector.
74
+ - **Bag of Words (BoW)**: Represents text based on word frequencies.
75
+ - **TF-IDF (Term Frequency-Inverse Document Frequency)**: Adjusts word frequency by importance.
76
+ - **Word2Vec**: Embeds words in a vector space using deep learning.
77
+ - **GloVe**: Uses global co-occurrence statistics for embedding.
78
+ - **FastText**: Similar to Word2Vec but includes subword information.
79
+ """
80
+ )
81
+
82
+ # Section: Stemming
83
+ st.header("5. Stemming")
84
+ st.write(
85
+ """
86
+ Stemming reduces words to their base or root form by chopping off prefixes or suffixes. It is a rule-based heuristic process
87
+ and can produce words that may not be valid in the language.
88
+
89
+ Example:
90
+ - Original Words: "running", "runner", "runs"
91
+ - Stemmed Form: "run"
92
+ """
93
+ )
94
+
95
+ if st.button("Try Stemming Example"):
96
+ words = ["running", "runner", "runs"]
97
+ stemmed_words = [word[:-3] if word.endswith("ing") else word[:-2] if word.endswith("er") else word for word in words]
98
+ st.write("Original Words:", words)
99
+ st.write("Stemmed Words:", stemmed_words)
100
+
101
+ # Section: Lemmatization
102
+ st.header("6. Lemmatization")
103
+ st.write(
104
+ """
105
+ Lemmatization reduces words to their dictionary or base form, called a lemma, while considering the context of the word in a sentence.
106
+
107
+ Example:
108
+ - Original Words: "studying", "better", "carrying"
109
+ - Lemmatized Form: "study", "good", "carry"
110
+
111
+ Lemmatization is more accurate than stemming but computationally more intensive as it requires a language dictionary.
112
+ """
113
+ )
114
+
115
+ if st.button("Try Lemmatization Example"):
116
+ words = ["studying", "better", "carrying"]
117
+ lemmatized_words = ["study" if word == "studying" else "good" if word == "better" else "carry" if word == "carrying" else word for word in words]
118
+ st.write("Original Words:", words)
119
+ st.write("Lemmatized Words:", lemmatized_words)